Question

考虑这个Python脚本：

from lxml import etree

html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
  <body>
    <p>This is some text followed with 2 citations.<span class="footnote">1</span>
       <span сlass="footnote">2</span>This is some more text.</p>
  </body>
</html>'''

tree = etree.fromstring(html)

for element in tree.findall(".//{*}span"):
    if element.get("class") == 'footnote':
        print(etree.tostring(element, encoding="unicode", pretty_print=True))

所需的输出是2个span元素，而不是：

<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span>
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.

为什么它包含元素之后的文本直到父元素的结尾？

我正在尝试使用lxml来链接脚注，当我a.insert()将span元素添加到我为其创建的a元素时，它包含后面的文本，因此链接大我不希望链接的文本数量。

Answer 1

指定with_tail=False将删除尾部文字。

print(etree.tostring(element, encoding="unicode", pretty_print=True, with_tail=False))

请参阅lxml.etree.tostring documentation。

Answer 2

它包含元素后面的文本，因为该文本属于元素。

如果您不希望该文本属于上一个范围，则需要将其包含在其自己的元素中。但是，在将with_tail=False作为etree.tostring()的参数转换回XML时，可以避免打印此文本。

如果要将元素从特定元素中删除，也可以将元素尾设置为''。

为什么lxml中的这个元素包含尾部？

2 个答案: