xml.etree.ElementTree.ParseError:格式不正确(令牌无效)

时间:2018-06-26 19:13:29

标签: python python-3.x xml-parsing

使用Python 3

我们收到的错误:

File "C:/scratch.py", line 27, in run
    tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
  File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

我们的代码:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
    for i in tree.iter('item'):
        try:
            title = i.find('title').text
        except Exception:
            pass

response [0]来自返回的url get请求列表,但是在索引0的情况下,对一个特定的url进行测试:http://feeds.feedburner.com/marginalrevolution/feed

我们能够将XML代码插入W3 School验证程序并获得:

This page contains the following errors:
error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67

但是使用ET.XMLParser(encoding='utf-8')属性,这不能解决解析时的错误吗?

1 个答案:

答案 0 :(得分:1)

错误消息W3 Schools验证程序具有误导性。 0x0c的问题不是因为它是无效的UTF-8,而是因为它不是XML中的legal character

0x0c form feed 控件字符,因此它在文档中的作用不大。合格的XML解析器必须拒绝格式不正确的文档,并且您不能更改rss提要,因此最简单的解决方案是在处理之前将其从文档中删除。

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>