用特殊情况查询xml文件

时间:2016-10-19 11:13:20

标签: java xml

我有两个大文件,我从Stackoverflow收集了一个名为posts.xmlquestions.txt的文件,结构如下:

posts.xml:

<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/>
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." />
  ...
</posts>

帖子可以是问题或答案(两者)

questions.txt:

Id,CreationDate,CreationDatesk,Score
123,2008-08-01 16:08:52,20080801,48
126,2008-08-01 16:10:30,20080801,33
...

我想在帖子上查询一次,并使用lucene索引所选行(其ID在questions.txt文件中)。由于xml文件非常大(大约50GB),查询和索引的时间对我来说很重要。

现在问题是:如何找到posts.xml中重复的所有选定行questions.txt

到目前为止,这是我的方法:

SAXParserDemo.java:

public class SAXParserDemo {
    public static void main(String[] args){

        try {
            File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml");
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();
            UserHandler userhandler = new UserHandler();
            saxParser.parse(inputFile, userhandler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Handler.java:

public class Handler extends DefaultHandler {

    public void getQuestiondId() {
        ArrayList<String> qIDs = new ArrayList<String>();
        BufferedReader br = null;
        try {
            String qId;
            br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt"));
            while ((qId = br.readLine()) != null) {
                qId = qId.split(",")[0];  //this is question id
                findAndIndexOnPost(qId);    //find this id on posts.xml then index it!
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private void findAndIndexOnPost(String qID) {

    }

    @Override
    public void startElement(String uri,
                             String localName, String qName, Attributes attributes)
            throws SAXException {
        if (qName.equalsIgnoreCase("row")) {
            System.out.println(attributes.getValue("Id"));
            switch (attributes.getValue("PostTypeId")) {
                case "1":
                    String id = attributes.getValue("Id");
                    break;
                case "2":
                    break;
                default:
                    break;
            }

        }
    }
}

更新

我需要在每次迭代中将指针保持在xml文件中。但是对于SAX,我不知道该怎么做。

1 个答案:

答案 0 :(得分:1)

What you have to do is:

  • read the TXT file (probably a simple stream will do).
  • add all Id values to a List<Integer> questionIds - one by one. You will have to parse them manually (with a regex or String.indexOf()).
  • in your Handler implementation simply compare if questionIds.contains(givenId).
  • send the received object (from XML) to Elastic Search with a simple REST request (POST/PUT).

Ta-da! Your data is now indexed with lucene.

Also, change the way you pass data to SAX Parser. Instead of giving it a File, create an implementation of InputStream for it which you can give to saxParser.parse(inputStream, userhandler);. Info on getting position in a stream here: Given a Java InputStream, how can I determine the current offset in the stream?.