Question

我需要一个 xml 解析器，它将按照所提供的方式返回节点内的确切文本。所以对于节点：

<title>This is title of special book</title>

如果

i = element.text

然后

i 应该是 "This is title of special book"

但是如果

<title>This is title of <i&ht;special book</title>

然后

i 应该是 "This is title of <i&ht;special book"

这样做的原因是稍后我在 HTML 模板中呈现这些变量，我需要按照提供的方式呈现它们 - 使用格式或转义标签。

我还没有找到用 lxml 做到这一点的方法——事实上，如果提供了未转义的 HTML 标签，它就不能正确读取文本（在上面的例子中，它将是 i = "This is title of "），如果提供了使用转义的 HTML 标签，它会对其进行转义。

我应该使用的正确替代方法是什么？或者也许有办法用 lxml 做到这一点？

Answer 1

我将使用自定义函数来确定节点文本是转义还是未转义，并可选择对其进行转义：

from lxml import etree as ET

def get_text(node):
  tag = node.tag
  text = node.text
  if len(f'<{tag}>{node.text}</{tag}>') == len(ET.tostring(root).decode()):
    return text
  else:
    return ET.tostring(node).decode().replace(f'<{tag}>', '').replace(f'</{tag}>', '')

xml 解析器，允许转义和非转义字符

1 个答案: