I've working with a legacy application and I'm trying to work out the difference between applications compiled with Multi byte character set
and Not Set
under the Character Set
option.
I understand that compiling with Multi byte character set
defines _MBCS
which allows multi byte character set code pages to be used, and using Not set
doesn't define _MBCS
, in which case only single byte character set code pages are allowed.
In the case that Not Set
is used, I'm assuming then that we can only use the single byte character set code pages found on this page: http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx
Therefore, am I correct in thinking that is Not Set
is used, the application won't be able to encode and write or read far eastern languages since they are defined in double byte character set code pages (and of course Unicode)?
Following on from this, if Multi byte character
set is defined, are both single and multi byte character set code pages available, or only multi byte character set code pages? I'm guessing it must be both for European languages to be supported.
Thanks,
Andy
Further Reading
The answers on these pages didn't answer my question, but helped in my understanding: About the "Character set" option in visual studio 2010
Research
So, just as working research... With my locale set as Japanese
Effect on hard coded strings
char *foo = "Jap text: テスト";
wchar_t *bar = L"Jap text: テスト";
Compiling with Unicode
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Compiling with Multi byte character set
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Compiling with Not Set
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Conclusion: The character encoding doesn't have any effect on hard coded strings. Although defining chars as above seems to use the Locale defined codepage and wchar_t seems to use either UCS-2 or UTF-16.
Using encoded strings in W/A versions of Win32 APIs
So, using the following code:
char *foo = "C:\Temp\テスト\テa.txt";
wchar_t *bar = L"C:\Temp\テスト\テw.txt";
CreateFileA(bar, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
CreateFileW(foo, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
Compiling with Unicode
Result: Both files are created
Compiling with Multi byte character set
Result: Both files are created
Compiling with Not set
Result: Both files are created
Conclusion:
Both the A
and W
version of the API expect the same encoding regardless of the character set chosen. From this, perhaps we can assume that all the Character Set
option does is switch between the version of the API. So the A
version always expects strings in the encoding of the current code page and the W
version always expects UTF-16 or UCS-2.
Opening files using W and A Win32 APIs
So using the following code:
char filea[MAX_PATH] = {0};
OPENFILENAMEA ofna = {0};
ofna.lStructSize = sizeof ( ofna );
ofna.hwndOwner = NULL ;
ofna.lpstrFile = filea ;
ofna.nMaxFile = MAX_PATH;
ofna.lpstrFilter = "All*.*Text*.TXT";
ofna.nFilterIndex =1;
ofna.lpstrFileTitle = NULL ;
ofna.nMaxFileTitle = 0 ;
ofna.lpstrInitialDir=NULL ;
ofna.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
wchar_t filew[MAX_PATH] = {0};
OPENFILENAMEW ofnw = {0};
ofnw.lStructSize = sizeof ( ofnw );
ofnw.hwndOwner = NULL ;
ofnw.lpstrFile = filew ;
ofnw.nMaxFile = MAX_PATH;
ofnw.lpstrFilter = L"All*.*Text*.TXT";
ofnw.nFilterIndex =1;
ofnw.lpstrFileTitle = NULL;
ofnw.nMaxFileTitle = 0 ;
ofnw.lpstrInitialDir=NULL ;
ofnw.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
GetOpenFileNameA(&ofna);
GetOpenFileNameW(&ofnw);
and selecting either:
- C:Tempテストテopenw.txt
- C:Tempテストテopenw.txt
Yields:
When compiled with Unicode
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
When compiled with Multi byte character set
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
When compiled with Not Set
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2
Conclusion:
Again, the Character Set
setting doesn't have a bearing on the behaviour of the Win32 API. The A
version always seems to return a string with the encoding of the active code page and the W
one always returns UTF-16 or UCS-2. I can actually see this explained a bit in this great answer: https://stackoverflow.com/a/3299860/187100.
Ultimate Conculsion
Hans appears to be correct when he says that the define doesn't really have any magic to it, beyond changing the Win32 APIs to use either W
or A
. Therefore, I can't really see any difference between Not Set
and Multi byte character set
.