Notes from using PSTSDK

Jun 22, 2010 at 11:58 PM

Yesterday, I wrote a PST-dumping script using pstsdk. There's a lot of interesting data hiding in PST files! And using pstsdk, it's really easy to access. You've definitely earned my thanks many times over. :-)

Here are a few notes from my work:

1) pst::message could use some more accessors, including 'From', 'Sender', 'Date', date received, and the RFC822 email headers (if present). I could easily throw a patch together for you tomorrow, if you're interested (and if I don't spend the day chasing server issues—c'est la vie). It seems like 'From' and 'Sender' are similar to, but not quite identical to, the pst::recipient class.
2) As mentioned in the other thread, using the pst::message API requires writing a lot of 'catch (key_not_found<...> ...)' blocks, because almost every accessor may raise that error. I'm not sure I have a better solution, but if you like, I could experiment with boost::optional and see what the resulting code looks like. Or maybe it would be better to define functions like 'has_subject'? I'm not really sure of the best approach here. But if you have any strong preferences, I'm happy to implement 'em.
3) How do ANSI-format PST files handle non-1252 code pages? Can I just check PidTagMessageCodepage and assume that all 8-bit strings are in that code page? Or do I need to find the right MS-OXO* PDF and start reading? :-) The following note in [MS-OXOPROPS].pdf looks promising:
Canonical name: PidTagMessageCodepage 
Description: Specifies the code page used to encode the non-Unicode string properties on this 
Message object. 
As usual, I'm happy to send you a patch if there's any particular behavior you'd prefer.
As always, many thanks for all your help, and thank you for such an excellent library!

Cheers,
Eric

Coordinator
Jun 23, 2010 at 12:17 AM

My thoughts:

1) For From & Sender, do you consider iterating over the recipient collection insufficent, or do you think there should be helper for getting a copy of the common recipients? From and Sender should be identical to the recipient object - internally there is a table with all the recipients, and some properties from that table are promoted to the message object (sender name, etc) at save time. As far as the recieved date goes; as a side note; I am very unhappy with the date representation currently in pstsdk. I'm converting everything back and forth from time_t, but internally in the PST they are generally stored as FILETIMEs. I wish I knew what the "prefered" way to do this was. If you'll look in object.h you'll see that time_t gave me some headaches on gcc.

2) I'm strongly considering a has_XXX() method for all accessor methods which may be optional, since the previous discussion. I hadn't considered boost::optional - I've been trying to avoid using boost in external facing parts of the library though.

3) The PST cares very little about codepages (or the content/format of non-unicode strings, or unicode strings for that matter). It just happily stores char arrays given to it and returns them when asked. I imagine things would look very confusing if you took a PST file created on a computer in one locale and moved to to another. I don't see any conversion happening. And PidTagMessageCodepage also isn't enforced or set by the PST proper (mspst32.dll), thats just a convention where MAPI client stamp what codepage they were using on the PST. The PST knows nothing of it (and a quick scan of the Outlook source code makes me suspect this is generally just used/set by the OAB). It's entirely possible that Outlook might do the conversions necessary to make these scenarios work.

In general, i am not opposed to adding any number of accessor methods to the message class, folder, etc as long as they seem "out of place" by not being present. I do want to avoid these classes becoming bloated with hundreds (or even dozens) of properties, though.

Aug 3, 2010 at 10:21 PM

1) For From & Sender, do you consider iterating over the recipient collection insufficent, or do you think there should be helper for getting a copy of the common recipients?

Interesting. How does this work? When I iterate over the recipient collection, I see entries for mapi_to, mapi_cc, and mapi_bcc, but nothing else. Do I need to do something special to get access to From and Sender?

2) I'm strongly considering a has_XXX() method for all accessor methods which may be optional, since the previous discussion.

This would be a reasonable approach. We're already written the helper methods we need for this, so it's mostly a moot point for us. But I agree that this is probably worth addressing anyway.

3) The PST cares very little about codepages (or the content/format of non-unicode strings, or unicode strings for that matter). It just happily stores char arrays given to it and returns them when asked.

Ouch. So if we encounter any non-1252 ANSI PSTs, I guess we'll just run character set detection on them, and hope that we get a plausible answer. :-)

Many thanks for answering my questions, as always!

Cheers, Eric

Dec 14, 2010 at 1:46 AM

Eric, did you manage to figure out the answer to your "Interesting. How does this work?" part?

Can you share the helper functions you refer to in 2) above?

Mar 24, 2011 at 8:49 PM

I was able to get the sender email address by using the following snippet:

std::wstring senderEmail = message.get_property_bag().read_prop<std::wstring>(0x0C1F);

The tag I used (0x0C1F) corresponds to the MAPI tag PR_SENDER_EMAIL_ADDRESS_W.  To get the sender name (PR_SENDER_NAME_W), use tag 0x0C1A.