从XML文件中提取一些节点

时间:2014-10-07 15:36:29

标签: java xml

我需要从以这种方式格式化的XML文件中提取一些节点:

<collection sentiment="negativo">
<comment>
    <sentiment> ...</sentiment>
     <chars>...</chars>
    <words>...</words>
    <text>blabla</text>
    <lang>english</lang>
  </comment>

现在假设在同一个XML文件中有其他<comment>元素具有<lang>spanish</lang>。 我需要创建两个单独的XML文件。第一个让ALL THE NODES拥有孩子<lang>english</lang>(让我们称之为eng.xml),第二个拥有<lang>spanish</lang>(让我们称之为spa.xml)

这是我的JAVA代码:

public void getEnglishRows() throws IOException{
    OutputStreamWriter f = new OutputStreamWriter(new FileOutputStream("C:/eclipse/neg_eng.xml"));
    BufferedWriter buff;

    NodeList current_row = doc.getElementsByTagName("comment"); //Mette in una lista tutti i nodi row (che contengono a loro volta degli elementi)
    NodeList tmp;
    Node nodo = null;

    buff = new BufferedWriter(f);
    for(int i=0;i< current_row.getLength();i++){
        tmp = current_row.item(i).getChildNodes();
        for(int k=0;k<tmp.getLength();k++){
            nodo = tmp.item(k);

            if("english".equals(nodo.getTextContent()))
                System.out.println("IF ENGLISH");
                buff.write(current_row.item(i).getNodeValue());                         
        }
    }


    buff.close();
}

我不知道我是否清楚,我希望如此。

所以我有一个很多<comment></comment>的Xml文件。我要从这个全部<comment></comment>中提取<lang>english</lang>并将节点(带有它的子节点)写入另一个XML文件。 <lang>spanish</lang>的行为相同。

eng.xml的输出是:

<comment>
    <sentiment> ...</sentiment>
     <chars>...</chars>
    <words>...</words>
    <text>blabla</text>
    <lang>english</lang>
  </comment>

spa.xml的输出是:

 <comment>
        <sentiment> ...</sentiment>
         <chars>...</chars>
        <words>...</words>
        <text>blabla</text>
        <lang>spanish</lang>
      </comment>

我希望我很清楚。我的问题是我可以提取所有节点的文本,但它不会保留XML标签!!

请帮助我!

1 个答案:

答案 0 :(得分:0)

为什么不尝试删除不是英文的评论? 所以我的建议是搜索标签并检测非英语标签。然后转到包含节点(元素)的父元素并删除它。这样可以保留原始文件结构。

试试这段代码。它对我有用:)

public void getEnglishRows() throws IOException, SAXException, ParserConfigurationException, TransformerException{      
    OutputStreamWriter f = new OutputStreamWriter(new FileOutputStream("./eng_sent.xml"));
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new FileInputStream("C:/eclipse/neg_eng.xml"));

    NodeList current_row = doc.getElementsByTagName("lang"); // search for the lang element

    for(int i=0;i< current_row.getLength();i++){            
        String lang = current_row.item(i).getTextContent();

        if (!lang.equalsIgnoreCase("english")) {
            // delete not english comment
            Element comment = (Element) current_row.item(i).getParentNode();
            doc.getDocumentElement().removeChild(comment);
            doc.normalize();
        }           
    }

    // write the content into xml file
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    DOMSource source = new DOMSource(doc);
    StreamResult result = new StreamResult(f);
    transformer.transform(source, result);      
}

文件neg_eng将如下所示:

<collection sentiment="negativo">
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng3</text>
    <lang>english</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng1</text>
    <lang>english</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng2</text>
    <lang>english</lang>
</comment>  

原始xml文件是:

<collection sentiment="negativo">
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng3</text>
    <lang>english</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>spa2</text>
    <lang>spanish</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng1</text>
    <lang>english</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>eng2</text>
    <lang>english</lang>
</comment>
<comment>
    <sentiment> ...</sentiment>
    <chars>...</chars>
    <words>...</words>
    <text>spa1</text>
    <lang>spanish</lang>
</comment>

希望这会对你有所帮助! 快乐黑客; - )