Question

我正在尝试编写一个用于读取NASA Rss feed的java程序。代码可以工作，但是当代码遇到符号时，它不会读取整行。例如 - “美国国家航空航天局的一项新研究发现，南极洲最后剩下的部分＆＃039;拉森B冰架，在2002年部分崩溃，正在迅速减弱，并可能在十年结束前完全瓦解”。在上面这行中，代码不会在Antartica之后读取整行。代码有什么问题???我该怎么解决？如果没有＆amp;＃039; s符号，代码就可以正常工作。 Feed的链接：“http://www.nasa.gov/rss/dyn/earth.rss”

package xmlparseprac;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Handler extends DefaultHandler {
boolean mtitle=false;
boolean mdescription=false;
boolean mitem;

@Override
public void startDocument() throws SAXException {
    super.startDocument(); 
    System.out.println("Starting...");
}

@Override
public void endDocument() throws SAXException {
    super.endDocument(); 
    System.out.println("Ending...");
}

@Override
public void startElement(String string, String string1, String string2, Attributes atrbts) throws SAXException {
    super.startElement(string, string1, string2, atrbts); 
    if(string2.equalsIgnoreCase("item")){mitem=true;}
    if(string2.equalsIgnoreCase("title")){mtitle=true;}
    if(string2.equalsIgnoreCase("description")){mdescription=true;}
}

@Override
public void endElement(String string, String string1, String string2) throws SAXException {
    super.endElement(string, string1, string2);
    if(string2.equalsIgnoreCase("item")){mitem=false;}
    if(string2.equalsIgnoreCase("title")){mtitle=false;}
    if(string2.equalsIgnoreCase("description")){mdescription=false;}
}

@Override
public void characters(char[] chars, int i, int i1) throws SAXException {
    super.characters(chars, i, i1);
    if(mtitle==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Title:"+s);
        mtitle=false;}
    if(mdescription==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Description:"+s);
        mdescription=false;
    }
}

}

Answer 1

我终于找到了问题的答案。

链接：＆＃34; http://www.javaexperience.com/strip-invalid-characters-from-xml/＆＃34; 链接：＆＃34; https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html＆＃34;

commons apache-lang-StringEscapeUitls库包含一个名为unescapeHtml4的方法。它删除了＆amp;＃039等html编码字符和其他等效字符。只需将URL输入流转换为字符串并使用unescapeHtml14函数到字符串并从中提取输入流并使用inputstream作为参数调用解析函数。感谢@duffymo通知我关于＆＃34;魔术字符＆＃34;。

Nasa Rss提供Sax解析错误

1 个答案: