为什么XML解析如此困难?

时间:2016-07-14 11:33:39

标签: python xml python-3.x

我正在尝试解析从EPO-OPS收到的这份简单文件。

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

我在做

import xml.etree.ElementTree as ET
root = ET.parse('pyre.xml').getroot()
for child in root:
    for kid in child:
        for abst in kid:
            for p in abst:
                print (p.text)

有没有类似于json的简单方法:

print (root.exchange-documents.exchange-document.abstract.p.text)

2 个答案:

答案 0 :(得分:2)

使用BeautifulSoup会容易得多。试试这个:

from bs4 import BeautifulSoup

xml = """<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>"""

“长”解决方案:

soup = BeautifulSoup(xml)
for sub_cell_tag in soup.find_all('abstract'):
    print(sub_cell_tag.text)

如果你进入一个衬垫:

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')]))

答案 1 :(得分:2)

您可以在ElementTree中使用XPath表达式。请注意,因为您使用xmlns定义了全局XML命名空间,所以需要指定该URL:

tree = ElementTree.parse(…)

namespaces = { 'ns': 'http://www.epo.org/exchange' }
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces)
for paragraph in paragraphs:
     print(paragraph.text)