Question

lxml.etree.XMLSyntaxError，文档标记为UTF-16但具有UTF-8内容

我在python中使用lxml lib收到错误。其他解决方案/黑客正在将文件php中的utf-16替换为utf-8。解决这个问题的pythonic方法是什么？

python代码：

import lxml.etree as etree

tree =  etree.parse("req.xml")

req.xml：

<?xml version="1.0" encoding="utf-16"?>
<test 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
</test>

Answer 1

查看XMLParser构造函数的文档：

>>> help(etree.XMLParser)

在其他选项中，有一个encoding参数，可让您覆盖文档编码＆＃34;，正如文档所说。

这正是你所需要的：

parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)

如果错误消息是正确的（即文档没有任何其他问题），那么我希望这可以工作。

Answer 2

您可以使用BeautifulSoup解析xml内容，这是您需要的pythonic方式。

注意：如果您的数据以utf-16编码，则可以在阅读/ PARSE文件内容期间通过utf-8解码轻松解析。

以下是代码：

sample.xml包含以下数据：

<?xml version="1.0" encoding="utf-16"?>
<test 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
</test>

代码：

from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
    content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8

soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data

输出：

{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}

lxml.etree.XMLSyntaxError，文档标记为UTF-16但具有UTF-8内容

2 个答案: