On U.S. English Windows, you can usually assume that the file might be encoded
with UTF-8 or
Windows-1252, but if you guess
wrong, you might get text that looks like this:
�It�s mine,� he said.
â€œItâ€™s mine,â€ he said.
or worst yet, this:
While it’s obviously best to know the encoding that’s used by the input you’re
processing, sometimes there’s no way to know it ahead of time. In that case,
there are libraries that can guess the encoding, usually based on statistical
analysis of the bytes or detection of invalid byte sequences. The Mozilla
project has a universal charset detector, and
Microsoft has been shipping MLang, a COM component that provides code page
detection through the
IMultiLanguage2.DetectCodepageInIStream method since IE5.
The COM interfaces and structures we need are declared as follows (definitions
taken from MLang.h in the Windows SDK):
With the addition of a helper class to expose a .NET Stream as a COM IStream,
we can call MLang as follows: