如何从python中的输入巨大文件中解析/提取特定值?

时间:2014-10-15 20:34:19

标签: python regex parsing

我有以下巨大的输入文件(来自stackexchange数据集):

 <row Id="659890" PostTypeId="2" ParentId="655986" CreationDate="2009-03-18T20:06:33.720" />
 <row Id="659891" PostTypeId="2" ParentId="659089" CreationDate="2009-03-18T20:07:44.843" /> 

通常,我处理文件的方式是逐行阅读:

f = open( "file.txt", "r" )
for line in f:
   print line

但是,对于这种情况,我想通过邮寄方式处理它。我怎么能这样做?

此外,我希望能够提取PostTypeId的值并将其保存在变量中(我也想对其他值执行相同的操作)。

所以我的问题是:假设数据集真的很大,最有效的方法是什么?

2 个答案:

答案 0 :(得分:1)

您可以使用xml.etree.ElementTree

import xml.etree.ElementTree as ET
tree = ET.parse(source)
root = tree.getroot()
# Look at each element that has 'row' tag
for row in root.iter('row'):
    print row.get('PostTypeId')

编辑junk after document

with open(someFile, 'r') as data:
    xmlData = '<rows>' + data.read() + '</rows>'
rows = ET.fromstring(xmlData)
for row in rows:
    print row.get('PostTypeId')

答案 1 :(得分:1)

如果您确保<tag />在每一行上,并考虑到内存,这可能对您有效:

from xml.etree import ElementTree as ET

with open('yourfile', 'r') as f:
    # file is already a generator of lines
    for line in f:
        # use fromstring so you don't even need to wrap with another tag
        tree = ET.fromstring(line)
        # attrib will return all you need in a dict {key:value}
        # you may store this dict, append to a list, write to a file or even database
        print tree.attrib

您的样本的结果:

{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:06:33.720', 'Id': '659890', 'ParentId': '655986'}
{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:07:44.843', 'Id': '659891', 'ParentId': '659089'}