Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

In the CPP reference documentation,

I noticed for char

The character types are large enough to represent any UTF-8 eight-bit code unit (since C++14)

and for char8_t

type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits)

Does that mean both are the same type? Or does char8_t have some other feature?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
384 views
Welcome To Ask or Share your Answers For Others

1 Answer

Disclaimer: I'm the author of the char8_t P0482 and P1423 proposals.

In C++20, char8_t is a distinct type from all other types. In the related proposal for C, N2653, char8_t is a typedef of unsigned char similar to the existing typedefs for char16_t and char32_t.

In C++20, char8_t has an underlying representation that matches unsigned char. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char, but has different aliasing rules.

In particular, char8_t was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.

  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).

  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).

  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...