Update: A "reference implementation" is now available as a perl module. The tables included cover 57 encodings. You can download the module and view its documentation at CPAN: Encode-HEBCI-0.01.
HEBCI is a technique that allows a web form handler to transparently detect the character set its data was encoded with. By using carefully-chosen character references, the browser's encoding can be inferred. Thus, it is possible to guarantee that data is in a standard encoding without relying on (often unreliable) webserver/browser encoding interactions.
If you're accepting input from a user, you need to know what encoding she used to submit the form. When John submits a blog post or Jane writes to her friend using webmail, you must know what format they're using in order to reliably use the information later. This is where the real problem begins: most browsers don't allow you to actively find out what codepage was used to submit the form, they just give you the variable/value pairs from the form, and that's that. In Internet Explorer, it's possible to use an undocumented DOM element to access the current codepage, but that won't work for everyone. It also raises potential issues with JavaScript permissions. Therefore, we need to find an alternate method.
The ideal solution will be entirely browser-neutral and passive. Unfortunately, the HTML spec doesn't define any mechanism for this. We need to find some other, sneakier, way to extract the current character encoding from the browser.
Luckily for us, there is a trick we can use for this: entity
codes. Entity codes are strings like &
, which
were (are) used to encode specific characters without using Unicode.
When the browser displays a page, it replaces these with the
appropriate character from the current encoding. Thus, &
becomes the character 0x26
in most codepages. By
itself, this is merely implementation trivia. However, this
translation process occurs whenever a user submits a form. That is,
the browser parses any entities in the form variables and replaces
them with the current encoding's representation of those characters
when the user clicks submit. Thus, any entity codes within the form
fields are passed along as character values in the browser's current
encoding.
So, all we have to do is find an entity that is encoded differently in two different codepages. We slip that into a form field, and then look at its value when we get data. This allows us to differentiate between the two encodings. In fact, we could look at all entities in many codepages, and find the ones that allowed us to disambiguate between many codepages. This is what I've done.
We add hidden form elements with values containing various entity
codes, such as °, ÷, and —. Then, when
the user submits the form, we take each of those and compare them
against a list of what character has what value in what codepage.
That is, each codepage has a unique fingerprint for the values of
°,÷,—
. For MacRoman, it's
a1,d6,d1
; for UTF-8, c2b0,c3b7,e28094
.
Thus, we only have to go through our table of
codepage-to-fingerprint mappings, and see which fingerprint matches.
Note that, once this table is discovered, the cost of fingerprinting a given form submission is very low. And, in the case of misses, you can assume whatever your page's default codepage is. This fallthrough case is equivalent to what the code would have done before adding this detection layer.
Surprisingly, one can distinguish between the Big Three (ISO-8859-1/Windows-1252, MacRoman, and UTF-8), with a single entity: º.
Codepage | º |
---|---|
UTF-8 | c2ba |
ISO-8859-1 | ba |
MacRoman | bc |
Differentiating a larger set of encodings requires more entities. The current test implementation works great with only five codepoints:
my @fp_ents = qw/deg divide mdash bdquo euro/; my %fingerprints = ( "UTF-8" => ['c2b0','c3b7','e28094','e2809e','e282ac'], "WINDOWS-1252" => ['b0','f7','97','84','80'], "MAC" => ['a1','d6','d1','e3','db'], "MS-HEBR" => ['b0','ba','97','84','80'], "MAC-CYRILLIC" => ['a1','d6','d1','d7',''], "MS-GREEK" => ['b0','','97','84','80'], "MAC-IS" => ['a1','d6','d0','e3',''], "MS-CYRL" => ['b0','','97','84','88'], "MS932" => ['818b','8180','815c','',''], "WINDOWS-31J" => ['818b','8180','815c','',''], "WINDOWS-936" => ['a1e3','a1c2','a1aa','',''], "MS_KANJI" => ['818b','8180','','',''], "ISO-8859-15" => ['b0','f7','','','a4'], "ISO-8859-1" => ['b0','f7','','',''], "CSIBM864" => ['80','dd','','',''], );
A demo application. Note that you'll need to click the button to submit an initial fingerprint. After subtmitting the form with your default encoding, change to something else in the list above, and try it again. It should be updated to reflect the new codepage.
It's also worthwhile to view the source, to see just how simple this is from the HTML side. With minor additions like these to forms, it is now possible to check the correct encoding of data, allowing web developers to guarantee normalization and smooth interoperability with other, more picky, protocols.
This is kind of complicated, but I'll sketch it out for now:
iconv(1)
.
On my test machine, I had access to around 950 codepages, and tried running the above fingerprint-generation method on them. This ran into several problems:
Each of these three obstacles can be overcome, and I believe that a truly universal fingerprint mechanism could be developed. The more practical technique for now is to shoot for the top 25 or 50 codepages, as they will likely cover the majority of users.