关于正则表达式和XML

时间:2011-10-20 15:24:29

标签: python xml regex

我有XML格式的数据。示例如下所示。我想从<text> tag中提取数据。 这是我的XML数据。

    <text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>

我只需要The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex.这可能吗?感谢

5 个答案:

答案 0 :(得分:4)

使用XML解析器解析XML。使用lxml

import lxml.etree as ET

content='''\
<text>
    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

    <h1>The plot</h1>
    Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
    <h1>Cast</h1>

    <h1>Soundtrack</h1>

    <h1>External Links</h1>
</text>
'''

text=ET.fromstring(content)
print(text.text)

产量

    The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

答案 1 :(得分:2)

不要使用正则表达式来解析XML / HTML。在python中使用适当的解析器,如BeautifulSoup或lxml。

答案 2 :(得分:2)

每当您发现自己正在查看XML数据并考虑正则表达式时,您应该停下来问问自己为什么不考虑使用真正的XML解析器。 XML的结构使其非常适合于正确的解析器,并使正则表达式令人沮丧。

如果必须使用正则表达式,则应执行以下操作: 直到您的文档发生变化

import re
p = re.compile("<text>(.*)<h1>")
p.search(xml_text).group(1)

Spoiler:正则表达式可能是合适的,如果这只是一个需要快速而肮脏的解决方案的一次性问题。或者,如果您知道输入数据是相当静态的并且不能容忍解析器的开销,那么它们可能是合适的。

答案 3 :(得分:2)

以下是使用ElementTree

执行此操作的方法
In [18]: import xml.etree.ElementTree as et

In [19]: t = et.parse('f.xml')

In [20]: print t.getroot().text.strip()
The 40-Year-Old Virgin is a 2005 American buddy comedy
    film about a middle-aged man's journey to finally have sex.

答案 4 :(得分:1)

以下是使用xml.etree.ElementTree的示例:

>>> import xml.etree.ElementTree as ET
>>> data = """<text>
...     The 40-Year-Old Virgin is a 2005 American buddy comedy
...     film about a middle-aged man's journey to finally have sex.
... 
...     <h1>The plot</h1>
...     Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
...     <h1>Cast</h1>
... 
...     <h1>Soundtrack</h1>
... 
...     <h1>External Links</h1>
... </text>"""
>>> xml = ET.XML(data)
>>> xml.text
"\n    The 40-Year-Old Virgin is a 2005 American buddy comedy\n    film about a middle-aged man's journey to finally have sex.\n\n    "
>>> xml.text.strip().replace('\n   ', '')
"The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex."

你去了!