Writing large Unicode chars to streams fails?

Jan 1, 2011 at 12:28 AM
Edited Jan 1, 2011 at 12:28 AM

I have a message with a subject that contains two Unicode characters above 65000.

When I write these out to a disk file with

std::wofstream fCsvFile;		// Stream output: CSV file for import

...

	std::wstring CsvFname = g_OutPath + g_sPrefix + L".txt";
	fCsvFile.open(CsvFname.c_str(), std::ios::out);

...

	fCsvFile << nFileIndexNum << L"\t"	// ITEMID is Index number
		 << wSavedFile << L"\t"		// FILEPATHNAME is Full path to output file
		 << wStubFile << L"\t"		// FILENAME is output file name without path
		 << L"error\t"			// STATUS = Error to force MD5 recalculation
		 << wSubject << L"\t"		// Subject
		 << moddate << L"\t"		// MODIFYDATE
		 << wEid << L"\t"		// UserField1 is message EID
		 << g_PstIndexNum << L"\n";	// UserField2 is parent (PST) ID

writing to the output stream simply stops with the first Unicode character, and nothing else gets written to the output stream ever. But there is no error generated, so I can't tell what's going wrong, and there doesn't seem to be anything I can catch to tell me it broke.

This one I am sure is my fault somehow... does anyone have any suggestions as to what I am doing wrong here?

Coordinator
Jan 1, 2011 at 4:13 AM

I wouldn't be so sure it's your fault. Unicode in Windows is built around USC-2, and when most of this was built the UTF-16 wasn't officially around, or was very new. I wouldn't be surprised if someone along the path didn't handle these characters properly.

You might want to look at the string to see if it's a properly encoded UTF-16 string (for these two specific characters). I'm guess it's not. Then it's a matter of finding out if it's encoded that way in the PST file, or got mangled somewhere up along the way. If it is properly encoded, then follow up with your compiler vendor.

Jan 5, 2011 at 9:49 PM
Edited Jan 6, 2011 at 12:24 AM

The actual string is:

RE: Mr X\xFFC2\xFFB4s Situation

According to my understanding, those are valid, albeit odd, Unicode 1.0 symbols. My compiler vendor is Microsoft (VS2008). Are you aware of any related issues with streams in that compiler? And this is a very large (800Mb) file; can you suggest how I might check the PST file content? The PST file does load correctly in Outlook, though it doesn't seem to display those characters...

Edit: I have found at least one commentator who says that Microsoft Visual Studio 2005's implementation of wofstream does not actually handle UTF-16 / UCS-2; that only characters 0-255 are handled correctly. It would seem that this could be a problem in VS2008 as well. I do note that if I attempt to write the standard BOF mark 0xFEFF at the beginning of a stream opened with wofstream, that nothing ever gets into that stream - the file is created, but ends up 0 length.

Coordinator
Jan 6, 2011 at 6:32 AM

Hmm. It looks like creating a wide stream only means that it accepts wide strings as input, and that the output encoding is still ANSI. The characters aren't convertible in your current locale, so it fails. You'll need to imbue the proper locale, or a utf8 facet as the locale.

See this question, which links to this boost library.

Unicode is really really complicated, and iostreams are rocket science.

Jan 6, 2011 at 9:14 PM

In fact, in this case I don't want ANSI out at all; I really want the original Unicode. For the moment, I'm assuming that the data in the PST file is UTF-16, and will use the library contents untranslated -- I have no assurance that the PST file I'm working with can be translated into the code page of the machine I'm running on.

The way I'm doing that is basically brain-dead:

std::ofstream fCsvFile;		// Stream output: CSV file for import

...

	fCsvFile.open(CsvFname.c_str(), std::ios::out | std::ios::binary);

	wchar_t wCsvLine[400];
	::StringCchPrintfW(wCsvLine, 400, L"%s", wSubject.c_str());
	fCsvFile.write( (char *)wCsvLine, ::wcslen(wCsvLine) * sizeof(wchar_t));

I know it breaks the platform independence, and I'm not happy about that; I could use sprintf there instead, by including cstdio, and I may do that, or the equivalent later on, particularly if I can find a platform-independent formatting routine that doesn't suffer sprintf's potential buffer overrun issue. All I can say for it is, it works.