Detecting the Character Encoding of a File

There are several occasions when it’s necessary to automatically detect the encoding that’s used by a file: perhaps your program has an “Import” feature that allows the user to open an arbitrary text file, or perhaps you need to read a HTML file and don’t have access to (or can’t trust) the Content-Type HTTP header. (For an introduction to encodings, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.)

On U.S. English Windows, you can usually assume that the file might be encoded with UTF-8 or Windows-1252, but if you guess wrong, you might get text that looks like this:

�It�s mine,� he said.

or this:

â€œItâ€™s mine,â€ he said.

or worst yet, this:

Unhandled System.Text.DecoderFallbackException

While it’s obviously best to know the encoding that’s used by the input you’re processing, sometimes there’s no way to know it ahead of time. In that case, there are libraries that can guess the encoding, usually based on statistical analysis of the bytes or detection of invalid byte sequences. The Mozilla project has a universal charset detector, and Microsoft has been shipping MLang, a COM component that provides code page detection through the IMultiLanguage2.DetectCodepageInIStream method since IE5.

The COM interfaces and structures we need are declared as follows (definitions taken from MLang.h in the Windows SDK):

[ComImport, Guid("275c23e2-3747-11d0-9fea-00aa003f8646"), ClassInterface(ClassInterfaceType.None)]
internal class MultiLanguage
{
}

[ComImport, Guid("DCCFC164-2B38-11d2-B7EC-00C04F8F5D9A"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IMultiLanguage2
{
    // most methods omitted for brevity

    void DetectCodepageInIStream(MultiLanguageDetectCodePage flags, uint dwPrefWinCodePage,
        IStream pstmIn, ref DetectEncodingInfo lpEncoding, ref int pnScores);
    };

[StructLayout(LayoutKind.Sequential)]
struct DetectEncodingInfo
{
    public uint nLangID;
    public uint nCodePage;
    public int nDocPercent;
    public int nConfidence;
}

enum MultiLanguageDetectCodePage
{
    None = 0,
    SevenBit = 1,
    EightBit = 2,
    Dbcs = 4,
    Html = 8,
    Number = 16,
}

With the addition of a helper class to expose a .NET Stream as a COM IStream, we can call MLang as follows:

// get stream

Stream stream = /* ... */;

// wrap input stream with an IStream

ManagedIStream istream = new ManagedIStream(stream);

// create MLang object

IMultiLanguage2 multiLanguage = (IMultiLanguage2) new MultiLanguage();

// allocate a number of DetectEncodingInfo structures for MLang to fill in

DetectEncodingInfo[] infos = new DetectEncodingInfo[8];
int infoCount = infos.Length;

// detect the code page

multiLanguage.DetectCodepageInIStream(MultiLanguageDetectCodePage.None, 0, istream, ref infos[0], ref infoCount);
GC.KeepAlive(istream);

// take the best code page that was found

int nCodePage = (int) infos.Take(infoCount).OrderByDescending(i => i.nConfidence).Select(i => i.nCodePage).FirstOrDefault();
return Encoding.GetEncoding(nCodePage);

Full source code for this post (with additional error handling) is available in StreamUtility.cs, NativeMethods.cs, and ManagedIStream.cs.

Posted by Bradley Grainger on May 13, 2010