将(所有)维基百科数据加载到mongodb中?

时间:2013-06-24 22:25:53

标签: python xml mongodb wikipedia elementtree

在MongoNYC 2013大会上,一位发言者提到他们使用了维基百科的副本来测试他们的全文搜索。我自己试图复制这个,但由于文件大小和格式的原因,我发现它非常重要。

这就是我正在做的事情:

$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 
$ python
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('enwiki-latest-pages-articles.xml')
Killed

当我尝试使用标准XML解析器解析时,Python错误的大小与xml文件大小相同。有没有人有任何其他建议如何将9GB XML文件转换为JSON-y我可以加载到mongoDB?

更新1

按照下面Sean的建议,我也尝试了迭代元素树:

>>> import xml.etree.ElementTree as ET
>>> context = ET.iterparse('enwiki-latest-pages-articles.xml', events=("start", "end"))
>>> context = iter(context)
>>> event, root = context.next()
>>> for i in context[0:10]:
...     print(i)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
>>> for event, elem in context[0:10]:
...     if event == "end" and elem.tag == "record":
...             print(elem)
...             root.clear()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'

同样,没有运气。

更新2

跟进Asya Kamsky的建议如下。

这是尝试使用xml2json

$ git clone https://github.com/hay/xml2json.git
$ ./xml2json/xml2json.py -t xml2json -o enwiki-latest-pages-articles.json enwiki-latest-pages-articles.xml
Traceback (most recent call last):
  File "./xml2json/xml2json.py", line 199, in <module>
    main()
  File "./xml2json/xml2json.py", line 181, in main
    input = open(arguments[0]).read()
MemoryError

这是xmlutils

$ pip install xmlutils
$ xml2json --input "enwiki-latest-pages-articles.xml" --output "enwiki-latest-pages-articles.json"
xml2sql by Kailash Nadh (http://nadh.in)
    --help for help


Wrote to enwiki-latest-pages-articles.json

但内容只是一条记录。它没有迭代。

xmltodict,看起来很有前途,因为它使用迭代Expat进行广告并且对维基百科很有用。但是在20分钟左右之后它也会耗尽内存:

>>> import xmltodict
>>> f = open('enwiki-latest-pages-articles.xml')
>>> doc = xmltodict.parse(f)
Killed

更新3

这是对Ross的答案的回应,将我的解析器建模在link he mentions

之上
from lxml import etree

file = 'enwiki-latest-pages-articles.xml'

def page_handler(page):
    try:
        print page.get('title','').encode('utf-8')
    except:
        print page
        print "error"

class page_handler(object):
    def __init__(self):
        self.text = []
    def start(self, tag, attrib):
        self.is_title = True if tag == 'title' else False
    def end(self, tag):
        pass
    def data(self, data):
        if self.is_title:
            self.text.append(data.encode('utf-8'))
    def close(self):
        return self.text

def fast_iter(context, func):
    for event, elem in context:
        print(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

process_element = etree.XMLParser(target = page_handler())

context = etree.iterparse( file, tag='item' )
fast_iter(context,process_element)

错误是:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in fast_iter
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112653)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113223)
  File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83186)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 22, column 1

2 个答案:

答案 0 :(得分:1)

您需要使用iterparse进行迭代,而不是将整个文件加载到内存中。至于如何转换为json甚至是python对象以存储在db中 - 请参阅:https://github.com/knadh/xmlutils.py/blob/master/xmlutils/xml2json.py

更新

使用iterparse并保持较低内存占用的示例:

尝试Liza Daly's fast_iter的变体。在处理了一个元素elem之后,它调用elem.clear()来删除后代,并删除前面的兄弟。

from lxml import etree

def fast_iter(context, func):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        print(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly的文章非常精彩,尤其是在处理大型XML文件时。

答案 1 :(得分:1)

以防万一有人在2018年偶然发现了这个问题。

现在,有一行命令可用(Node.js):

https://github.com/spencermountain/dumpster-dive

相关问题