That's due to a naive implementation of read_prop<std::wstring> in object.h (line 329).
Basically, it goes like this:
HTML is typically transfered in a 8bit character encoding (possibly in ANSI or UTF-8 or some other 7 or 8bit compatible encoding). This means one byte per character. When the property gets stored in the PST, it's stored as PT_BINARY... instead of PT_STRING8.
This is unfortunate, because the caller who is reading the stored property has no way of knowing if that binary blob is 8bit character data or 16bit character data.
The implementation of read_prop looks at the type specified for the property, and if it's PT_STRING8 (aka 001E), it will correctly cast to a ANSI (8bit) string, then convert that to a wchar string (16bit), and return the wchar.... If it's anything else,
it simply casts it to wchar via bytes_to_wstring... Unfortunately, since the data was originally stored in 8bit ANSI format, this causes garbage data, as your 8bit character data is now being interpreted as 16bit character data.
In the .NET wrapper we're working on, we compensated for this by just assuming that the HTML body will always be in 8bit format... as it really doesn't make sense to store it any other way. Our current implementation is a bit of a bad hack, and actually
takes the wchar returned from get_HtmlBody() and converts it back to an ANSI string... so it's not ideal. But I'm about to change that to just not use the get_HtmlBody() method, and call the get_value_variable directly to get a byte vector.. then cast to an
This does leave the potential that we may trash valid Unicode content, if that field is ever stored as 16bit char data (like many of the others are)... I'm not sure of a solid way to detect 16bit vs 8bit when faced with a binary blob of unknown char data.
I guess there's a lot of heuristics that could come close in a detection scheme, but wouldn't be perfect all the time.
For reference, here's the implementation of read_prop that get_HtmlBody uses (from object.h)...
inline std::wstring const_property_object::read_prop<std::wstring>(prop_id id) const
std::vector<byte> buffer = get_value_variable(id);
if(get_prop_type(id) == prop_type_string)
std::string s(buffer.begin(), buffer.end());
return std::wstring(s.begin(), s.end());
and bytes_to_wstring (util.h : line 221):
inline std::wstring pstsdk::bytes_to_wstring(const std::vector<byte> &bytes)
if(bytes.size() == 0)
return std::wstring(reinterpret_cast<const wchar_t *>(&bytes), bytes.size()/sizeof(wchar_t));