Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

There's a string that is in UTF-8 encoding, I can read it from a file and write it into another file just fine. But when I try to load each of the characters in that string one by one the result isn't coherent. I'm most likely doing this in a very wrong way, but what is the correct way to do this?

The content in source.txt is

afternoon_gb_1          ɑft?nun

The code i wrote is

while (source >> word >> word_ipa) { 
for (char& c : word_ipa)
 myfile <<word<<" is " << c<< endl;}

The content in the txt file myfile gets written as

afternoon_gb_1 is ?
afternoon_gb_1 is ?
afternoon_gb_1 is f
afternoon_gb_1 is t
afternoon_gb_1 is ?
afternoon_gb_1 is ?
afternoon_gb_1 is n
afternoon_gb_1 is u
afternoon_gb_1 is n
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
150 views
Welcome To Ask or Share your Answers For Others

1 Answer

In UTF-8 each code-point (=logical character) is represented by multiple code units (=char); ɑft?nun, in particular, is:

ch| c.p. | c.u.
--+------+-------
ɑ | 0251 | c9 91
f | 0066 | 66
t | 0074 | 74
? | 0259 | c9 99
n | 006e | 6e
u | 0075 | 75
n | 006e | 6e

(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)

The exact details of how the code points are mapped to the code units is explained in many places; the very basics are that:

  • code points less than 0x7f are mapped straight to a single code unit; for these, the high bit is never set;
  • code points from 0x80 onwards are mapped to multiple code units; all the code units in a multi-code-unit sequence have the high bit set;
  • if the high bit is set, the top bits have a particular meaning; in the first byte of a multibyte sequence they tell how many continuation bytes are to be expected, in the others they are unambiguously marked as continuation bytes.

If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees

c9 0a

(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the ? character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ?.


Now, if you want to print out full code-points (not code-units), std::string won't be of any help - std::string knows nothing about this stuff, it is essentially a glorified std::vector<char>, completely oblivious of encoding issues; all it does is to store/index code units, not code points.

There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the utf8::next function would be particularly helpful:

while (source >> word >> word_ipa) {
    auto cur = word_ipa.begin();
    auto end = word_ipa.end();
    auto next = cur;
    for(;cur!=end; cur=next) {
        utf8::next(next, end);
        myfile << word << "is ";
        for(; cur!=next; ++cur) myfile<<*cur;
        myfile << "
";
    }
}

utf8::next here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.

Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):

template<typename ItT>
void safe_advance(ItT &it, size_t n, ItT end) {
    size_t d = std::distance(it, end);
    if(n>d) throw std::logic_error("Truncated UTF-8 sequence");
    std::advance(it, n);
}


template<typename ItT>
void my_next(ItT &it, ItT end) {
    uint8_t b = *it;
    if(b>>7 == 0) safe_advance(it, 1, end);
    else if(b>>5 == 6) safe_advance(it, 2, end);
    else if(b>>4 == 14) safe_advance(it, 3, end);
    else if(b>>3 == 30) safe_advance(it, 4, end);
    else throw std::logic_error("Invalid UTF-8 sequence");
}

Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.

(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)

OTOH, it's also possible to inline just what's necessary to keep the same code unit together:

while (source >> word >> word_ipa) {
    auto cur = word_ipa.begin();
    auto end = word_ipa.end();
    for(;cur!=end;) {
        myfile << word << "is "<<*cur;
        if(uint8_t(*cur++)>>7 != 0) {
            for(; cur!=end && (uint8_t(*cur)>>6)==2; ++cur) myfile<<*cur;
        }
        myfile << "
";
    }
}

Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...