Question

给定一个lxml元素xml，我通过调用c[0..n]遍历其所有子c.getnext()。那是因为我需要在必要时动态插入子项，而我不能使用迭代器。所有元素都设置了text和tail。

让我用以下示例说明addnext()和insert()的不同行为。假设一个简单的XML字符串，我将其解析为lxml树，然后，为了理智，检查它：

>>> import lxml.etree
>>> s = "<p>This is <b>bold</b> and this is italic text.</p>"
# Create a new lxml element.
>>> xml = lxml.etree.fromstring(s)
# Let's look at the element, its child, and all the texts and tails.
>>> lxml.etree.tostring(xml)
b'<p>This is <b>bold</b> and this is italic text.</p>'
>>> xml.text
'This is '
>>> xml.tail
>>> xml[0].text
'bold'
>>> xml[0].tail
' and this is italic text.'

到目前为止这么好，而且正是我所期望的（关于lxml表示的更多信息，请参阅here）。

现在我想将“italic”这个词包装成标签，就像“粗体”包装在标签中一样。为此，我首先找到“italic”子字符串开始的索引：

# Find the index of the "italic" substring.
>>> idx = xml[0].tail.find("italic")
>>> idx
13

然后我创建一个新的lxml元素：

# Create a new element and inspect it.
>>> new_c = lxml.etree.fromstring("<i>italic</i>")
>>> new_c.text
'italic'
>>> new_c.tail
>>>

要正确地将这个新元素插入到xml树中，我必须将原始xml[0].tail字符串拆分为两个子字符串并从中删除“斜体”：

>>> new_c.tail = xml[0].tail[idx+len("italic"):]
>>> xml[0].tail = xml[0].tail[:idx]

现在一切都已设置为将新元素插入xml元素，这就是我现在困扰的问题。在给定的new_c之后插入新子xml[0] 会产生不同的结果，Element API不会向我提供任何新信息：

# Adds the element as a following sibling directly after this element. # Note that tail text is automatically discarded when adding at the root level. >>> xml[0].addnext(new_c) >>> lxml.etree.tostring(xml) b'This is bolditalic text. and this is '

和

# Inserts a subelement at the given position in this element >>> xml.insert(1 + xml.index(xml[0]), new_c) >>> lxml.etree.tostring(xml) b'This is bold and this is italic text.'

这两个调用似乎以不同的方式处理tail（请参阅addnext()关于tail的评论）。即使考虑到评论，文本也不会从中丢弃，而是附加到，根级别的处理也不会与进一步向下的级别不同（即，通过包装可以观察到完全相同的行为将s中的原始XML添加到其他<foo>代码中。

我在这里缺少什么？

编辑关于lxml邮件列表的相关讨论是here。

Answer 1

elem.addnext(nextelem)在XML级别上进行操作，即在元素移动新插入元素后面的任何尾部文本后直接添加内容。这样做是为了使新元素直接跟随兄弟。

parent.insert(where,elem)的工作方式与父元素只是etree.Element的列表完全相同。它在列表中放入一个新元素，而不对etree.Element实例进行任何更改。 parent.append(elem)也将以这种方式工作，或任何其他列表操作。

因此，这些函数在元素树上有两个不同的视图。

>>> from lxml import etree as et
>>> 
>>> x = et.XML('<a>foo<b/>bar</a>')
>>> y = et.XML('<c>C!</c>')
>>> 
>>> et.dump(x)
<a>foo<b/>bar</a>
>>> x.find('b').addnext(y)
>>> et.dump(x)
<a>foo<b/><c>C!</c>bar</a>

尾部从b元素移动到c元素，以使XML文档保持不变，除了插入的元素。

现在，如果插入的元素已经有尾部，则使用addnext插入元素及其后面的文本。直接在XML元素之后，而不是在etree Element-with-tail之后。

>>> x = et.XML('<a>foo<b/>bar</a>')
>>> y = et.XML('<c>C!</c>')
>>> y.tail = 'more...'
>>> 
>>> x.find('b').addnext(y)
>>> et.dump(x)
<a>foo<b/><c>C!</c>more...bar</a>

Answer 2

tail仅存在于lxml级别;在libxml2中，它就像在DOM中一样是一个文本节点。主要原因是解析格式相当的XML（http://lxml.de/tutorial.html#elements-contain-text）时的便利性：

两个属性.text和.tail足以表示XML文档中的任何文本内容。这样，除了Element类之外，ElementTree API不需要任何特殊的文本节点，这些节点往往会相当频繁（正如您可能从经典DOM API中了解到的那样）。

所有lxml函数都努力从源代码维护抽象AFAICS。例如。 index()只计算元素/ comments / entityrefs / PI节点，树操作例程似乎总是随之移动节点的尾部。但是，自从这个概念

未得到充分记录
是针对XML而定制的，其中用户不关心尾随文本
与常规陈述冲突

其应用似乎存在不一致之处。这看起来像一个（如果一致性是目标，则是一个错误）。我将与维护者讨论最后一条声明，以澄清库关于尾部的预期行为。

lxml：处理尾部时Element addnext（）和insert（）之间的区别

2 个答案: