Disclaimer: I'm the author of the char8_t
P0482 and P1423 proposals.
In C++20, char8_t
is a distinct type from all other types. In the related proposal for C, N2653, char8_t
is a typedef of unsigned char
similar to the existing typedefs for char16_t
and char32_t
.
In C++20, char8_t
has an underlying representation that matches unsigned char
. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char
, but has different aliasing rules.
In particular, char8_t
was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char
, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t
can be accessed via pointers to char
or unsigned char
, but pointers to char8_t
cannot be used to access char
or unsigned char
data. In other words:
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
The motivation for a distinct type with these properties is:
To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.
To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).
To ensure an unsigned type for UTF-8 data (whether char
is signed or unsigned is implementation defined).
To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…