c++ - Code to strip diacritical marks using ICU

Question

Welcome To Ask or Share your Answers For Others

c++ - Code to strip diacritical marks using ICU

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.:

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

Assume that s has already been normalized. Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

391 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:56:03+0000

ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>

std::string desaxUTF8(const std::string& str) {
    // UTF-8 std::string -> UTF-16 UnicodeString
    UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));

    // Transliterate UTF-16 UnicodeString
    UErrorCode status = U_ZERO_ERROR;
    Transliterator *accentsConverter = Transliterator::createInstance(
        "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
    accentsConverter->transliterate(source);
    // TODO: handle errors with status

    // UTF-16 UnicodeString -> UTF-8 std::string
    std::string result;
    source.toUTF8String(result);

    return result;
}

Categories

c++ - Code to strip diacritical marks using ICU

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags