SAX解析器用于非常庞大的XML文件

时间:2011-04-16 03:10:17

标签: java sax xml-parsing

我正在处理一个非常庞大的XML文件,4 GB并且我总是遇到内存不足错误,我的java堆已经达到了最大值,这就是为什么代码:

Handler h1 = new Handler("post");
        Handler h2 = new Handler("comment");
        posts = new Hashtable<Integer, Posts>();
        comments = new Hashtable<Integer, Comments>();
        edges = new Hashtable<String, Edges>();
         try {
                output = new BufferedWriter(new FileWriter("gephi.gdf"));
                SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
                SAXParser parser1 = SAXParserFactory.newInstance().newSAXParser();


                parser.parse(new File("G:\\posts.xml"), h1);
                parser1.parse(new File("G:\\comments.xml"), h2);
            } catch (Exception ex) {
                ex.printStackTrace();
            }

    @Override
         public void startElement(String uri, String localName, String qName, 
                    Attributes atts) throws SAXException {
                if(qName.equalsIgnoreCase("row") && type.equals("post")) {
                    post = new Posts();
                    post.id = Integer.parseInt(atts.getValue("Id"));
                    post.postTypeId = Integer.parseInt(atts.getValue("PostTypeId"));
                    if (atts.getValue("AcceptedAnswerId") != null)
                        post.acceptedAnswerId = Integer.parseInt(atts.getValue("AcceptedAnswerId"));
                    else
                        post.acceptedAnswerId = -1;
                    post.score = Integer.parseInt(atts.getValue("Score"));
                    if (atts.getValue("OwnerUserId") != null)
                        post.ownerUserId = Integer.parseInt(atts.getValue("OwnerUserId"));
                    else
                        post.ownerUserId = -1;
                    if (atts.getValue("ParentId") != null)
                        post.parentId = Integer.parseInt(atts.getValue("ParentId"));
                    else
                        post.parentId = -1;
                }
                else if(qName.equalsIgnoreCase("row") && type.equals("comment")) {
                    comment = new Comments();
                    comment.id = Integer.parseInt(atts.getValue("Id"));
                    comment.postId = Integer.parseInt(atts.getValue("PostId"));
                    if (atts.getValue("Score") != null)
                        comment.score = Integer.parseInt(atts.getValue("Score"));
                    else
                        comment.score = -1;
                    if (atts.getValue("UserId") != null)
                        comment.userId = Integer.parseInt(atts.getValue("UserId"));
                    else
                        comment.userId = -1;
                }
            }



public void endElement(String uri, String localName, String qName) 
         throws SAXException {
             if(qName.equalsIgnoreCase("row") && type.equals("post")){ 
                 posts.put(post.id, post);
                 //System.out.println("Size of hash table is " + posts.size());
             }else if (qName.equalsIgnoreCase("row") && type.equals("comment"))
                 comments.put(comment.id, comment);
         }

有没有办法优化这段代码,以便我不会耗尽内存?可能使用流?如果是的话,你会怎么做?

2 个答案:

答案 0 :(得分:3)

SAX解析器对故障有效。

帖子,评论和边缘HashMaps立即跳出来作为潜在的问题。我怀疑你需要定期从内存中清除这些地图以避免出现OOME。

答案 1 :(得分:0)

查看名为SaxDoMix http://www.devsphere.com/xml/saxdomix/

的项目

它允许您解析大型XML文件,并将某些元素作为已解析的DOM实体返回。比追求SAX解析器更容易使用。