Question

我正在解析一些大的xml文件，目的是提取body元素下包含的所有原始文本。我不知道文本出现在哪些子元素中 - 我只是连接所有文本。

用于实现此目的的相关代码位是：

# find all instances of `body`
result = []
for body in tree.findall('.//{*}body'): 
    # iterate recursively over all sub-elemetns of `body`
    for node in body.iter('*'):
        # append any text if it exists
        if node.text:
            # handle the text
            result.append(node.text.strip())
print(' '.join(result))

非常简单，我思考它正在运作，但我发现了一些失败的案例，我不确定如何解决。这是从xml文件中提取的最小示例：

<cja:body>
  <ce:sections xmlns:ce=".../xml/common/schema">
    <ce:para view="all">For scyptolin A or B the IC
                <ce:inf loc="post">50</ce:inf> was erroneously calculated at 3.1 μg/ml. The correct IC
                <ce:inf loc="post">50</ce:inf> was determined at 0.16 μg/ml for both scyptolins.
            </ce:para>
  </ce:sections>
</cja:body>

如果我在这个xml上运行上面的代码块，输出是：

For scyptolin A or B the IC 50 50

问题是，对于para节点，node.text似乎只检索嵌套inf元素之前发生的文本。如何提取所有文本，而不仅仅是在嵌套元素之前发生的文本？需要说明的是，这里所需的输出是：

For scyptolin A or B the IC 50 was erroneously calculated at 3.1 μg/ml. The correct IC 50 was determined at 0.16 μg/ml for both scyptolins.

当嵌套元素存在时，lxml无法提取所有文本？

0 个答案: