I'm using Japanese version of Microsoft Windows and Office.
(And I privately maintain some of PST reading libraries like pst-extractor (Node.js) and ruby-msg-nx (Ruby) at my private fork. They are no longer actively updated by original authors)
Although this ANSI encoding issue may be rare cases, I want workaround idea about this...
nonUnicodeCP932.zip
With XstReader.App
With Outlook 2013 Japanese version
As stated in screen shot, some texts are encoded in CP932.
This is an ANSI encoding which can be obtained by Encoding.Default
.
Encoding.Default is environment dependent by Windows.
Thus non Japanese Windows will return another encoder other than CP932.
PidTagInternetCodepage looks very useful to decide code page of ANSI texts.
But this is not a good idea.
As posted sample nonUnicodeCP932.zip shows, some texts are messed up.
Why PidTagInternetCodepage is set to CP50220 (iso-2022-jp)?
As a conclusion, PidTagInternetCodepage is how mail was encoded, when Outlook received it. This is useful only when we write Reply
or Forward
. Send a reply/forward mail with same encoding that sender mail client has used.
And then Outlook delivers many properties from received mail. For example it is like Subject.
When it is converted to property, it is time to choice: ANSI (CP932 or such) or Unicode.
iso-2022-jp is 7-bit encoding.
Shift-JIS (CP932) is 8-bit encoding. Using 0x80 ~ 0xFF. And default system locale of Japanese Windows is set to CP932.
Some of older SMTP server implementations were known to have compatibility issue for 8-bit binary.
Thus iso-2022-jp was preferred by many Japanese mail clients including Outlook.