The u8
prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.
So you have several factors at play:
- which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "???" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
- which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
- the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the
u8
prefix.
Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.
However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.
You can tell GCC which source encoding to assume with -finput-charset
, or you can encode the source as UTF-8, or you can use the uXXXX
escape sequences in the string literal ( u00E5
instead of ?
, for example)
Edit:
To clarify a bit, when you specify a string literal with the u8
prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)
If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "???". It will see the u8
prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "???".
However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:
- it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
- alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The
u8
prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.
Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.
But in both cases, note that the u8
prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).
The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.
The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8
prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.