Two draft patches: Mac build, sign-extension

Aug 4, 2010 at 3:41 PM

Good morning! Here are two draft patches. These still need to be tested on multiple platforms, but I thought it would be good to ask for feedback now instead of later. :-) Once you've had a chance to suggest corrections, I'll send you a single big patch by email.

Fix alignment of bth_nonleaf_entry under 32-bit Mac GCC

This is a fairly simple build patch for 32-bit MacOS X. We'll test this on 32-bit and 64-bit Linux before I officially submit it.

Convert ANSI string to Unicode wstring by assuming code page 1252

This is a much more dubious and debatable patch. :-) Our application calls 'props->read_prop<wstring>' quite often, and this works fine for Unicode PSTs. It also works for ANSI PSTs that contain ASCII text. But at least on my version of GCC, it fails for all characters between 0x80 and 0xFF, inclusive. Specifically, a character like 0xA7 gets sign-extended to 0xFFA7, instead of being converted to 0x00A7.

Since ANSI PSTs apparently don't contain reliable code page information, there's no way to make 'props->read_prop<wstring>' do the right thing automatically. But the current code appears to try to support ANSI PSTs using code page 1252, which are the most common ANSI PSTs. So my patch adds an explicit conversion function and some unit tests. It also tries to do the right thing with the tricky characters between 0x80 and 0x9F, inclusive, using the conversion tables Microsoft submitted to the Unicode consortium. Using this patch, it's possible to access code page 1252 ANSI PSTs using 'props->read_prop<wstring>'. Other code pages will still require calling 'props->read_prop<string>' and converting strings using an external library.

Obviously, this patch is still neither complete nor correct, but it might arguably be a step in the right direction. Please feel free to tell me that I'm nuts for even suggesting it. :-)

As always, thank you for such an excellent library, and for all your feedback!

Cheers, Eric

Aug 4, 2010 at 6:17 PM

The alignment change looks good. Could you add a comment to heap.h to explain why the intermediate variable is necessary?

I actually don't have any objections to the approach the second patch takes as well. I have no useful feedback, other than maybe the "specials" array should be const. The string to wstring conversions were just for convenience. I seriously considered just not allowing it all together (and throwing a bad_cast or something) to sidestep these issues, but eventually I decided doing the naive conversion would be better than nothing. I didn't consider sign extension, obviously. Since I've already gone down the "do something" road, any change which makes the naive conversion slightly less naive is a good thing.

As always, thank you for your contributions!

Aug 8, 2010 at 5:39 PM
Edited Aug 8, 2010 at 6:34 PM

No problem! I've made the updates you suggested, and I'm preparing a patch series for you here:

Once the PR_ENTRYID patch is ready, and I've tested everything on Windows and Linux, I'll submit a single, unified patch via email.

Cheers, Eric