java - UTF Encoding for Chinese CharactersJava

Question

Welcome To Ask or Share your Answers For Others

java - UTF Encoding for Chinese CharactersJava

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0 E5A5BD E59097 which is actually 你好吗 in UTF-8.

Any ideas what might be causing 你好吗 to become C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297? I did a Google search but all I got was a chinese website describing a problem that happens in python. Any insights will be great, thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

205 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:36+0000

You have what is known as a double encoding.

You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.

But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is that codepoint in UTF-8? Try it! It's C3 A4!

You get the idea.... :-)

Here is a Java app which illustrates this:

public class DoubleEncoding {
    public static void main(String[] args) throws Exception {
        byte[] encoding1 = "你好吗".getBytes("UTF-8");
        String string1 = new String(encoding1, "ISO8859-1");
        for (byte b : encoding1) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
        byte[] encoding2 = string1.getBytes("UTF-8");
        for (byte b : encoding2) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
    }
}

Categories

java - UTF Encoding for Chinese CharactersJava

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags