java - Sax parsing and encoding

Question

Welcome To Ask or Share your Answers For Others

java - Sax parsing and encoding

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a contact that is experiencing trouble with SAX when parsing RSS and Atom files. According to him, it's as if text coming from the Item elements is truncated at an apostrophe or sometimes an accented character. There seems to be a problem with encoding too.

I've given SAX a try and I have some truncating taking place too but haven't been able to dig further. I'd appreciate some suggestions if someone out there has tackled this before.

This is the code that's being used in the ContentHandler:

public void characters( char[], int start, int end ) throws SAXException {
//
    link = new String(ch, start, end);

Edit: The encoding problem might be due to storing information in a byte array as I know Java works in Unicode.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

156 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:56:03+0000

The characters() method is not guaranteed to give you the complete character content of a text element in one pass - the full text may span buffer boundaries. You need to buffer the characters yourself between the start and end element events.

e.g.

StringBuilder builder;

public void startElement(String uri, String localName, String qName, Attributes atts) {
   builder = new StringBuilder();
}

public void characters(char[] ch, int start, int length) {
   builder.append(ch,start,length);
}

public void endElement(String uri, String localName, String qName) {
  String theFullText = builder.toString();
}

Categories

java - Sax parsing and encoding

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags