HEBCI: HTML Entity-Based Codepage Inference

Author: Josh Myer <josh@joshisanerd.com>

Update: A "reference implementation" is now available as a perl module. The tables included cover 57 encodings. You can download the module and view its documentation at CPAN: Encode-HEBCI-0.01.

What is it?

HEBCI is a technique that allows a web form handler to transparently detect the character set its data was encoded with. By using carefully-chosen character references, the browser's encoding can be inferred. Thus, it is possible to guarantee that data is in a standard encoding without relying on (often unreliable) webserver/browser encoding interactions.

Why should I care?

If you're accepting input from a user, you need to know what encoding she used to submit the form. When John submits a blog post or Jane writes to her friend using webmail, you must know what format they're using in order to reliably use the information later. This is where the real problem begins: most browsers don't allow you to actively find out what codepage was used to submit the form, they just give you the variable/value pairs from the form, and that's that. In Internet Explorer, it's possible to use an undocumented DOM element to access the current codepage, but that won't work for everyone. It also raises potential issues with JavaScript permissions. Therefore, we need to find an alternate method.

How can we solve this problem?

The ideal solution will be entirely browser-neutral and passive. Unfortunately, the HTML spec doesn't define any mechanism for this. We need to find some other, sneakier, way to extract the current character encoding from the browser.

Luckily for us, there is a trick we can use for this: entity codes. Entity codes are strings like &, which were (are) used to encode specific characters without using Unicode. When the browser displays a page, it replaces these with the appropriate character from the current encoding. Thus, & becomes the character 0x26 in most codepages. By itself, this is merely implementation trivia. However, this translation process occurs whenever a user submits a form. That is, the browser parses any entities in the form variables and replaces them with the current encoding's representation of those characters when the user clicks submit. Thus, any entity codes within the form fields are passed along as character values in the browser's current encoding.

So, all we have to do is find an entity that is encoded differently in two different codepages. We slip that into a form field, and then look at its value when we get data. This allows us to differentiate between the two encodings. In fact, we could look at all entities in many codepages, and find the ones that allowed us to disambiguate between many codepages. This is what I've done.

The technique

We add hidden form elements with values containing various entity codes, such as °, ÷, and —. Then, when the user submits the form, we take each of those and compare them against a list of what character has what value in what codepage. That is, each codepage has a unique fingerprint for the values of °,÷,—. For MacRoman, it's a1,d6,d1; for UTF-8, c2b0,c3b7,e28094. Thus, we only have to go through our table of codepage-to-fingerprint mappings, and see which fingerprint matches.

Note that, once this table is discovered, the cost of fingerprinting a given form submission is very low. And, in the case of misses, you can assume whatever your page's default codepage is. This fallthrough case is equivalent to what the code would have done before adding this detection layer.

Implementations

Surprisingly, one can distinguish between the Big Three (ISO-8859-1/Windows-1252, MacRoman, and UTF-8), with a single entity: º.

Codepage	º
UTF-8	c2ba
ISO-8859-1	ba
MacRoman	bc

Differentiating a larger set of encodings requires more entities. The current test implementation works great with only five codepoints:

my @fp_ents = qw/deg divide mdash bdquo euro/;
my %fingerprints = (
		    "UTF-8" => ['c2b0','c3b7','e28094','e2809e','e282ac'],
		    "WINDOWS-1252" => ['b0','f7','97','84','80'],
		    "MAC"          => ['a1','d6','d1','e3','db'],
		    "MS-HEBR"      => ['b0','ba','97','84','80'],
		    "MAC-CYRILLIC" => ['a1','d6','d1','d7',''],
		    "MS-GREEK"     => ['b0','','97','84','80'],
		    "MAC-IS"       => ['a1','d6','d0','e3',''],
		    "MS-CYRL"      => ['b0','','97','84','88'],
		    "MS932"        => ['818b','8180','815c','',''],
		    "WINDOWS-31J"  => ['818b','8180','815c','',''],
		    "WINDOWS-936"  => ['a1e3','a1c2','a1aa','',''],
		    "MS_KANJI"     => ['818b','8180','','',''],
		    "ISO-8859-15"  => ['b0','f7','','','a4'],
		    "ISO-8859-1"   => ['b0','f7','','',''],
		    "CSIBM864"     => ['80','dd','','',''],
		   );

A Demonstration Application

A demo application. Note that you'll need to click the button to submit an initial fingerprint. After subtmitting the form with your default encoding, change to something else in the list above, and try it again. It should be updated to reflect the new codepage.

It's also worthwhile to view the source, to see just how simple this is from the HTML side. With minor additions like these to forms, it is now possible to check the correct encoding of data, allowing web developers to guarantee normalization and smooth interoperability with other, more picky, protocols.

Building the fingerprint table

This is kind of complicated, but I'll sketch it out for now:

(Get entities) Take the W3C Entity List, and massage to get entity-to-UTF-16 values.
(Gather codepage-specific data) Get a list of supported codepages on your system using iconv(1).
(Note codepage differences)
1. (Load data) Load all of these pieces into a large hash of <codepage,UTF-16>→bytestring values.
2. (Find differences) Create a 3-D matrix, codepage×codepage×UTF-16. At each point (codepage₀, codepage₁, n), mark a 1 if codepage₀'s value for UTF-16 character n is different from codepage₁'s. Also, keep track of how many differences are found for each UTF-16 value, so we can order them by how many pages they differentiate between.
3. (Minimize the size of the fingerprint data) Create a new list of all duples of codepages (one for each difference we need to recognize). Walk the difference-ordered list of UTF-16 codepoints, removing all duples that are differentiated by this codepoint (ie: if (codepage₀, codepage₁, n) has a value of 1, then remove <codepage₀, codepage₁> from the duple list. Note how many new differences each UTF-16 character captures. Once that reaches zero, you can drop all the remaining characters from your fingerprint: they're uneeded.
(Generate fingerprints) Generate a table, with the entities corresponding to each UTF-16 value across the top, the codepages down the side, and the value for the appropriate character in each cell.

The Future

On my test machine, I had access to around 950 codepages, and tried running the above fingerprint-generation method on them. This ran into several problems:

Redundant codepages: ISO-8859-1 and ISO88591, for instance
Solution: have a person go through and remove all known-redundant pages, and then, character-by-character, exhaustively check potentially-ambiguous pages against each other for any differences).
Ambiguous codepages: Some codepages cannot be differentiated between using entities alone.
Solution: Similiar to the above, generate a map from each character set into UTF-16, look for reverse (ie: UTF-16 → codepage) differences, and add those to the fingerprint set before culling
Seriously inefficient design: I actually had to cull the dataset in order to get any results at all. Checking all 950 codepages used up all of my RAM in short order.
Solution: spend more than an hour cobbling the search design together

Each of these three obstacles can be overcome, and I believe that a truly universal fingerprint mechanism could be developed. The more practical technique for now is to shoot for the top 25 or 50 codepages, as they will likely cover the majority of users.

The author's homepage. Last modified: Sat Jul 2 21:27:12 EDT 2005