There are several occasions when it’s necessary to automatically detect the encoding that’s used by a file: perhaps your program has an “Import” feature that allows the user to open an arbitrary text file, or perhaps you need to read a HTML file and don’t have access to (or can’t trust) the Content-Type HTTP header. (For an introduction to encodings, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.)
On U.S. English Windows, you can usually assume that the file might be encoded with UTF-8 or Windows-1252, but if you guess wrong, you might get text that looks like this:
�It�s mine,� he said.
or this:
“It’s mine,†he said.
or worst yet, this:
Unhandled System.Text.DecoderFallbackException
While it’s obviously best to know the encoding that’s used by the input you’re processing, sometimes there’s no way to know it ahead of time. In that case, there are libraries that can guess the encoding, usually based on statistical analysis of the bytes or detection of invalid byte sequences. The Mozilla project has a universal charset detector, and Microsoft has been shipping MLang, a COM component that provides code page detection through the IMultiLanguage2.DetectCodepageInIStream method since IE5.
The COM interfaces and structures we need are declared as follows (definitions taken from MLang.h in the Windows SDK):
With the addition of a helper class to expose a .NET Stream as a COM IStream, we can call MLang as follows:
Full source code for this post (with additional error handling) is available in StreamUtility.cs, NativeMethods.cs, and ManagedIStream.cs.
Posted by Bradley Grainger on May 13, 2010