In UTF-8 each code-point (=logical character) is represented by multiple code units (=char
); ɑft?nun, in particular, is:
ch| c.p. | c.u.
--+------+-------
ɑ | 0251 | c9 91
f | 0066 | 66
t | 0074 | 74
? | 0259 | c9 99
n | 006e | 6e
u | 0075 | 75
n | 006e | 6e
(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)
The exact details of how the code points are mapped to the code units is explained in many places; the very basics are that:
- code points less than 0x7f are mapped straight to a single code unit; for these, the high bit is never set;
- code points from 0x80 onwards are mapped to multiple code units; all the code units in a multi-code-unit sequence have the high bit set;
- if the high bit is set, the top bits have a particular meaning; in the first byte of a multibyte sequence they tell how many continuation bytes are to be expected, in the others they are unambiguously marked as continuation bytes.
If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees
c9 0a
(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the ? character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ?.
Now, if you want to print out full code-points (not code-units), std::string
won't be of any help - std::string
knows nothing about this stuff, it is essentially a glorified std::vector<char>
, completely oblivious of encoding issues; all it does is to store/index code units, not code points.
There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the utf8::next
function would be particularly helpful:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
auto next = cur;
for(;cur!=end; cur=next) {
utf8::next(next, end);
myfile << word << "is ";
for(; cur!=next; ++cur) myfile<<*cur;
myfile << "
";
}
}
utf8::next
here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.
Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):
template<typename ItT>
void safe_advance(ItT &it, size_t n, ItT end) {
size_t d = std::distance(it, end);
if(n>d) throw std::logic_error("Truncated UTF-8 sequence");
std::advance(it, n);
}
template<typename ItT>
void my_next(ItT &it, ItT end) {
uint8_t b = *it;
if(b>>7 == 0) safe_advance(it, 1, end);
else if(b>>5 == 6) safe_advance(it, 2, end);
else if(b>>4 == 14) safe_advance(it, 3, end);
else if(b>>3 == 30) safe_advance(it, 4, end);
else throw std::logic_error("Invalid UTF-8 sequence");
}
Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.
(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)
OTOH, it's also possible to inline just what's necessary to keep the same code unit together:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
for(;cur!=end;) {
myfile << word << "is "<<*cur;
if(uint8_t(*cur++)>>7 != 0) {
for(; cur!=end && (uint8_t(*cur)>>6)==2; ++cur) myfile<<*cur;
}
myfile << "
";
}
}
Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.