从lxml获取内部文本

时间:2015-06-11 06:11:37

标签: python lxml

lxml.html.fromstring坚持包装标签中的所有内容(默认为p)。从这个标签树,

<p>this is <b>the</b> good stuff<p>

我想提取字符串:

this is <b>the</b> good stuff

我该怎么做?

1 个答案:

答案 0 :(得分:8)

这通常被称为“内部xml”而不是“内部文本”。这是获取元素内部xml的一种可能方法:

import lxml.etree as etree
import lxml.html

html = "<p>this is <b>the</b> good stuff<p>"
tree = lxml.html.fromstring(html)
node = tree.xpath("//p")[0]

result = node.text + ''.join(etree.tostring(e) for e in node)
print(result)

输出

this is <b>the</b> good stuff
相关问题