lxml检查元素后是否存在文本(不仅仅是尾部)

时间:2016-04-18 14:14:34

标签: python text lxml

我正在解析包含嵌入另一个(<question>)下的特定标记(<Turn>)的xml文档,我需要检查结束标记后是否有文本{{ 1}}直到结束父标记</question>。问题是</Turn></question>之间可能存在其他标记,或者换行符,空格,甚至是上述所有标记,因此仅检索问题的尾部是不够的。< / p>

以下是我正在处理的xml文件的一部分示例:

</Turn>

我在python中使用lxml处理xml。当我想检查<root> <Turn speaker="spk2" startTime="5121.203" endTime="5136.265"> <question startline="8321" endline="8326"> <Sync time="5121.203"/> some text <Sync time="5126.531"/> <Sync time="5127.662"/> other text?</question><question startline="8326" endline="8326"> here are some other words? </question> <Sync time="5128.514"/> THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT <annoyingtag att="blah"/> AND THIS TOO </Turn> <Turn> <question> this is a question? </question> this is not, I want to get this text. </Turn> <Turn> There could be a turn with no question here. </Turn> <Turn> <question> and then another with a question? </question> followed by <Sync/> other text but also <Event/> other tags <Who/> and I want to get all this text. </Turn> </root> </question>之间是否有某些文字时,我已经处理了for循环处理问题,例如:

</Turn>

在这种情况下,我尝试使用Turns = rootnode.findall(".//Turn") for Turn in Turns: questions = Turn.findall(".//question") for question in question: if question == questions[-1]: #This is where I will insert the code trying to find if there is some text following the question tag. 和另一种方法question.tail()获取尾部,但在这两种情况下,我都看不到最后一个question.xpath("//text()")[1]和{之间的所有文本{1}}(无论是否或部分内容)。

我也尝试在带有正则表达式的原始文件上执行此操作,但由于在两个结束标记之间可能出现很多内容,因此我最终得到了带有嵌套量词的正则表达式以及灾难性回溯问题。

1 个答案:

答案 0 :(得分:0)

如果同步标记始终存在,则可能会有效:

xml = """<Turn speaker="spk2" startTime="5121.203" endTime="5136.265">
<question startline="8321" endline="8326">
<Sync time="5121.203"/>
some text
<Sync time="5126.531"/>
<Sync time="5127.662"/>
other text?</question><question startline="8326" endline="8326">
here are some other words?
</question>
<Sync time="5128.514"/>
THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT
<annoyingtag att="blah"/>
AND THIS TOO
</Turn>"""

from lxml.html import fromstring

xml = fromstring(xml)

print(xml.xpath("//question[last()]/following::sync/following::text()"))

哪会给你:

['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']

或者:

print(xml.xpath("//question[last()]/following::text()"))

这给了你:

['\n', '\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']

您还可以使用通配符:

 print(xml.xpath("//question[last()]/following::*/following::text()"))

这又会给你:

['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']