Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

    // double quotation (")
    replacements.add(new Replacement(Pattern.compile("[u201cu201du201eu201fu275du275e]"), """));

    // single quotation (')
    replacements.add(new Replacement(Pattern.compile("[u2018u2019u201au201bu275bu275c]"), "'"));

replacements is a custom class that I later run over and apply the replacements.

    for (Replacement replacement : replacements) {
         text = replacement.pattern.matcher(text).replaceAll(r.replacement);
    }

As you can see, I had to find:

  • LEFT SINGLE QUOTATION MARK
  • RIGHT SINGLE QUOTATION MARK
  • SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
  • SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
305 views
Welcome To Ask or Share your Answers For Others

1 Answer


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...