我正在解析一些大的xml文件,目的是提取body
元素下包含的所有原始文本。我不知道文本出现在哪些子元素中 - 我只是连接所有文本。
用于实现此目的的相关代码位是:
# find all instances of `body`
result = []
for body in tree.findall('.//{*}body'):
# iterate recursively over all sub-elemetns of `body`
for node in body.iter('*'):
# append any text if it exists
if node.text:
# handle the text
result.append(node.text.strip())
print(' '.join(result))
非常简单,我思考它正在运作,但我发现了一些失败的案例,我不确定如何解决。这是从xml文件中提取的最小示例:
<cja:body>
<ce:sections xmlns:ce=".../xml/common/schema">
<ce:para view="all">For scyptolin A or B the IC
<ce:inf loc="post">50</ce:inf> was erroneously calculated at 3.1 μg/ml. The correct IC
<ce:inf loc="post">50</ce:inf> was determined at 0.16 μg/ml for both scyptolins.
</ce:para>
</ce:sections>
</cja:body>
如果我在这个xml上运行上面的代码块,输出是:
For scyptolin A or B the IC 50 50
问题是,对于para
节点,node.text
似乎只检索嵌套inf
元素之前发生的文本。如何提取所有文本,而不仅仅是在嵌套元素之前发生的文本?需要说明的是,这里所需的输出是:
For scyptolin A or B the IC 50 was erroneously calculated at 3.1 μg/ml. The correct IC 50 was determined at 0.16 μg/ml for both scyptolins.