Mac patches & unit test debugging question

Jun 16, 2010 at 6:06 PM

Greetings! I've fixed an assertion failure and a unit test failure on the Mac: (patches 6154, 6155)

You can also find my latest set of patches (with build instructions) on github: . Note that is just a scratch repo while I'm working on the pstsdk port, and it may go away in the future.

The unit tests currently fail on databasetest.cpp:242:


The program exits with the following message:


terminate called after throwing an instance of 'pstsdk::key_not_found<unsigned int>'

what():  key not found

key not found

Program received signal SIGABRT, Aborted.

Here's a GDB stack trace:

#0  0x00238ad5 in __cxa_throw ()

#1  0x0002d1af in pstsdk::btree_node_leaf<unsigned int, pstsdk::node_info>::lookup (this=0x400290, k=@0xbffff068) at btree.h:297

#2  0x00019be0 in pstsdk::database_impl<unsigned long long>::lookup_node_info (this=0x804a00, nid=33) at database.h:541

#3  0x00019b1d in pstsdk::database_impl<unsigned long long>::lookup_node (this=0x804a00, nid=33) at database.h:79

#4  0x00006836 in test_db () at /Users/emk/src/aranetic/pstsdk/test/databasetest.cpp:241

#5  0x00055b12 in main () at /Users/emk/src/aranetic/pstsdk/test/main.cpp:20


I'm not really sure how to debug this from here. I suppose that I might try building pstsdk on Windows, and stepping through it in a debugger to see where the Mac behavior diverges. Any other suggestions or advice would be greatly appreciated.

Thank you for you help and advice!

Cheers, Eric


Jun 17, 2010 at 1:32 AM

I wrote a reply to this earlier this morning - it looks like Codeplex dropped it on the floor.

Anyway, I think the root cause of this issue is related to having the header structures incorrectly aligned. As I commented on one of the patches, the offset for the root_info structure was incorrect in your patch, so maybe that explains it. I know one thing for sure - database_impl::lookup_node_info shouldn't be calling directly into btree_node_leaf, it should be calling into a btree_node_nonleaf. So something is going wrong there, and I think it's reading the incorrect page off of disk. However I can't explain how it managed to get this far; if it read an incorrect address, the CRC should fail etc at some point before here. So that is a mystery.

Jun 17, 2010 at 12:34 PM

Yeah, root_info does have an incorrect offset in my patch, which is weird, because I manually counted the bytes in doc/[MS-PST].pdf. In particular, see " HEADER", where there's an 8-byte spacer named qwUnused in the Unicode header, but only a 4-byte gap in your version of the structure.

Similarly, the latest [MS-PST].pdf says that root_info<ulongulong> should take up 72 bytes in the structure, but it takes up 80 in your code. And your code appears to work.

So, the moral of this story is not to trust [MS-PST].pdf, but to instead use your code to dump structure offsets. I'll keep working on getting the structures aligned correctly, using a copy of Visual C++ 10 Express.

Many thanks for helping me with this! I really appreciate your time and effort.

Cheers, Eric

Jun 17, 2010 at 1:46 PM

OK, I'm going through disk.h and adding more static_assert calls on the structures which are giving me trouble. I'll submit another patch with nothing but these static_assert calls, and then we can figure out how to get things correctly aligned on other platforms.

My guess is we won't need lots of #ifdefs. There's a couple of GCC knobs I can twiddle to force alignment once I actually understand what's going on. Now that I know not to entirely trust [MS-PST].pdf, this should be relatively easy. ;-)

Jun 17, 2010 at 2:16 PM

OK, I've submitted a (very rough) patch with more static assertions on all the tricky structures. Note that this patch appears to be broken on GCC (which doesn't like some of the assertions). I'll fix these problems and upload another patch.

Jun 17, 2010 at 2:56 PM

New patches, tested on Mac and Win32:

6162: Updated CMake patch which deletes test/makefile to prevent case-insensitive filename collision with CMake's generated test/Makefile.

6163: Updated static assertion patch which only calls offset_of on POD ("plain old data") classes, for compatibility with GCC.

Please feel free to ignore the older versions of these patches.

When you get in later today, would you like to get in touch via voice or IRC to figure out how to handle alignment in a cross-platform fashion? Thank you for all your help in porting pstsdk to other platforms!


Jun 17, 2010 at 3:46 PM

Ok, you've given me a lot to think about and look at, thanks for all your work here. :) I'll take a closer look at the patches in a bit, but for now I looked into the [MS-PST] issues you pointed out and I think I see what's going on.

[MS-PST] has an 8 byte buffer before the root_info structure, and root_info has a 4 byte reserved value then then 8 byte ibFileEof value. After the root_info structure is a 4 byte alignment (in the header structure).

pstsdk has a 4 byte buffer (alignment) before the root_info structure.. The root structure has a 4 byte cOrphans member, followed by a 4 byte alignment member. An in pstsdk, the 4 byte alignment located after the root structure is actually a member of the root structure. 

So, basically, when all is said and done, all member variables of interest are located at the same absolute offsets in both [MS-PST] and pstsdk. Things get confusing because A) cOrphans member is present in pstsdk, where it's ignored as general padding/buffer in [MS-PST], and B) various paddings are attributed to different structures sometimes between the two references.

I'll make a note to update [MS-PST] to move around the padding to better reflect what pstsdk and the compiler do. If you see any other inconsistencies between pstsdk and [MS-PST], please point them out. Thanks again!

Jun 17, 2010 at 4:17 PM

Thank you for looking into these problems! I really appreciate all your help.

I've managed to get all the data structures correctly aligned under MacPorts GCC 4.4, simply by adding two (rather kludgy) __attribute((aligned(...))) declarations:

diff --git a/pstsdk/disk/disk.h b/pstsdk/disk/disk.h
index 591442a..1a9e09e 100644
--- a/pstsdk/disk/disk.h
+++ b/pstsdk/disk/disk.h
@@ -111,7 +111,7 @@ enum crypt_method
 template<typename T>
 struct root
-    typedef T location;
+    typedef T location __attribute((aligned(sizeof(T))));
     typedef T count;
     ulong cOrphans;             //!< The number of "orphans" in the BBT
@@ -562,7 +562,7 @@ struct page_trailer
     ushort signature;      //!< Signature of this page, as calculated by the \ref compute_signature function
     ulong crc;             //!< CRC of this page, as calculated by the \ref compute_crc function
     block_id_disk bid;     //!< The id of this page
+} __attribute((aligned(sizeof(ulonglong))));
 //! \cond static_asserts
 static_assert(sizeof(page_trailer) == 16, "page_trailer incorrect size");
 //! \endcond

These __attribute declarations are basically equivalent to aligning all ulonglong fields to an 8-byte boundary. Between them, they fix the alignment of root and page_trailer throughout the file (coincidentally fixing the other issues in page_header along the way). Obviously, this is a very kludgy solution, and we can presumably do better with a bit of thought.

However, these changes cause a new test failure:

bt->cb: 156 size: 156
bt->cb: 1024 size: 0
Assertion failed: (bt->cb == size), function test_block, file /Users/emk/src/aranetic/pstsdk/test/disktest.cpp, line 21.

This appears to be caused by something going wrong in align_disk and align_slot. I don't quite understand those two functions yet, so this may take me a little while to sort out. If you'd like to try for a real-time discussion at some point, please let me know--I'm working on pstsdk full-time for the next several days at least.

Once again, thank you for your help!

Cheers, Eric

Jun 17, 2010 at 4:39 PM

All space allocations in the PST are multiples of 64 bytes. The point of align_slot is to round up the passed size to the nearest 64 bytes. The point of align_disk is to take the data size of a block, add in the block trailer size, and align that value to the nearest slot. Another detail - block trailers are always aligned to the end of the allocation, so any padding added because of align_slot is between the "real" data (aligned to the start of the allocation) and the block trailer (aligned against the end of the allocation). That's why you see the weird calculation on line 16 of disktest.cpp.

I suspect there are still issues in structure alignment not caught by static_asserts based on that error. Perhaps we need to add more around block and block_trailer structures?

Jun 17, 2010 at 4:48 PM

Thank you for explaining align_disk and align_slot! If you'd like, I can add some unit tests for these functions.

I'll add some more alignment assertions this afternoon, and send you another patch. Now that I have a working Windows build, this isn't too hard.

Jun 17, 2010 at 6:36 PM

It looks like you're right! I've been reading through disk.h, adding static assertions, and finding all sorts of little structures which will wind up misaligned.

The only significant different between Windows and Mac here is the fact that Windows is aligning ulonglong to 8 bytes, and Mac is aligning ulongulong to 4 bytes. Now that I know what to look for, it's pretty easy to find the offenders by hand. This is apparently a well-known issue, with several possible workarounds.

To fix this, we need to decide which workaround adds the least crud to your nice clean code. :-) I'm thinking that __attribute__((ms_struct)) could be wrapped up in a macro and appended to each of the structure definitions, but maybe one of the available pragmas would be a better choice.

Any thoughts? I'm happy to do whatever works for you, and supply a patch that has been tested on several compilers. Thank you for your feedback!

Jun 17, 2010 at 7:17 PM

OK, '#pragma ms_struct on' seems to work fairly well, and I'll be preparing an appropriate patch.

The tests are now failing in highlevel.cpp:

Green Category
Red Category
terminate called after throwing an instance of 'pstsdk::key_not_found'
  what():  key not found

Here's a stack trace:

#0  0x00238ad5 in __cxa_throw ()
#1  0x00038d05 in pstsdk::name_id_map::lookup (this=0xbfffebec, p=@0xbfffef64) at nameid.h:323
#2  0x00030688 in test_nameid_map_samp1 (pdb={> = {_M_ptr = 0x804200, _M_refcount = {_M_pi = 0x4004c0}}, }) at /Users/emk/src/aranetic/pstsdk/test/highlevel.cpp:22
#3  0x00033335 in test_highlevel () at /Users/emk/src/aranetic/pstsdk/test/highlevel.cpp:303
#4  0x00055b29 in main () at /Users/emk/src/aranetic/pstsdk/test/main.cpp:22

This looks more alignment problems somewhere. Do you know which other headers contain low-level structures that might need '#pragma ms_struct on'? I'll keep working, and let you know what I find.

Jun 17, 2010 at 7:28 PM

I've added '#pragma ms_struct on' to database_iface.h, which has the only other suspicious 'struct' definitions. Unfortunately, the error is still occurring. Will keep digging...

Jun 17, 2010 at 9:12 PM

util\primitives.h is the only other file I can think of that might be problematic. I actually made an effort to centralize anything which is alignment sensitive in disk.h, for this reason. Let me know if you figure out what structure is the problem and I'll probably end up moving it.

Jun 17, 2010 at 9:24 PM

I've temporarily wrapped everything in ms_struct, so I don't think it's an alignment problem (or at least not an obvious one).

However, on the Mac, I only get a few lines of test output before this failure, and one Windows, I get many pages of output by the time I reach this point. It seems like the various 'iterate(...)' calls in highlevel.cpp are only looking at a small amount of the PST on the Mac. And the following code in name_id_map::lookup seems to suggest that there's a problem:

    prop_stream bucket(const_cast<name_id_map*>(this)->m_bag.open_prop_stream(get_bucket_prop(hash_value)));
    disk::nameid_hash_entry entry;
    while(*)&entry, sizeof(entry)) != 0)
        if( (entry.hash_base == hash_base) &&
            (disk::nameid_is_string(entry) == p.is_string()) &&
            (disk::nameid_get_guid_index(entry) == guid_index)
            // just double check the string..
                if(construct(disk::nameid_get_prop_index(entry)).get_name() != p.get_name())

            // found it!
            return disk::nameid_get_prop_index(entry) + 0x8000;

    throw key_not_found<named_prop>(p);

On the Mac, we only make one trip through this loop, and the comparison of entry.hash_base == hash_base fails.

Thank you for any advice you can provide!

Jun 18, 2010 at 2:17 AM

Found it, in nameid.h:

    ulong compute_hash_base(const named_prop& n) const 
        { return n.is_string() ? disk::compute_crc(&(n.get_name())[0], n.get_name().length() * sizeof(wchar_t)) : n.get_id(); }

The problem here is wchar_t, which is 2 bytes on Windows, and 4 bytes on the Mac (and most newer Unix-like systems, including Linux). Normally, this wouldn't matter, but here, the CRC function is accessing the raw bytes of a std::wstring.

Fixing this shouldn't be too hard, and I can grep for other occurrences of sizeof(wchar_t) and inspect them manually. I'm just glad that all our systems are little-endian, so I don't have to do any byte-swapping!

Thank you for your hints! You've helped this port go quite quickly!

Cheers, Eric

Jun 18, 2010 at 2:32 AM

Interesting! So basically, the CRC calculation was off because the string was twice the length (even though the string had the correct value)? And I was trying to be all nice and portable by using sizeof(wchar_t), but to no avail..

Did you ever figure out why the output of the unit test was shorter on Mac? Is everything passing now?

Jun 19, 2010 at 2:16 AM

Current status:

  • Working: OS X 10.5 (under MacPorts)
  • Working: Ubuntu 10.04 32-bit
  • BROKEN: Ubuntu 10.04 64-bit

I'm going to debug a for a little bit more, then go to sleep.

Once it's all working, I need to retest on Windows and organize the patches for submission.

Jun 19, 2010 at 2:50 AM

OK, Ubuntu 10.04 64-bit took less than an hour to working using an EC2 system, at a total cost of $0.34. :-) I needed to change read_prop<long> to specify a 32-bit type (and not a 64-bit long), and that was it.

Each successive system is getting easier and easier, which is a good sign. I'll start preparing the patches tomorrow, and I'll try to submit them sometime this weekend if all goes well.

Jun 19, 2010 at 2:56 AM

Amazing progress. That reminds me of another item on my list: switch over to using boost types exclusively (uint_32, etc) to avoid this exact problem.