Python 3

Question

当我使用xmltodict加载下面的xml文件时，我收到一个错误： xml.parsers.expat.ExpatError：格式不正确（令牌无效）：第1行第1列

这是我的档案：

<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>

来源：

import xmltodict
with open('fileTEST.xml') as fd:
   xmltodict.parse(fd.read())

我在Windows 10上，使用Python 3.6和xmltodict 0.11.0

如果我使用ElementTree，则可以使用

tree = ET.ElementTree(file='fileTEST.xml')
    for elem in tree.iter():
            print(elem.tag, elem.attrib)

mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}

注意：我可能遇到了新的线路问题注2：我在两个不同的文件上使用了Beyond Compare 它在UTF-8 BOM编码的文件上崩溃，并且与UTF-8文件一起工作 UTF-8 BOM是一个字节序列（EF BB BF），允许读者将文件识别为以UTF-8编码。

Answer 1

在我的情况下，文件使用字节顺序标记保存，默认情况下使用notepad ++

我将文件重新保存，不用将BOM恢复为普通utf8。

Answer 2

Python 3

一个班轮

data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))

`.json`和`.xml`的助手

我编写了一个小的辅助函数，用于从给定的.json加载.xml和path文件。我认为这可能对这里的某些人有用：

import json
import xml.etree.ElementTree

def load_json(path: str) -> dict:  
    if path.endswith(".json"):
        print(f"> Loading JSON from '{path}'")
        with open(path, mode="r") as open_file:
            content = open_file.read()

        return json.loads(content)
    elif path.endswith(".xml"):
        print(f"> Loading XML as JSON from '{path}'")
        xml = ElementTree.tostring(ElementTree.parse(path).getroot())
        return xmltodict.parse(xml, attr_prefix="@", cdata_key="#text", dict_constructor=dict)

    print(f"> Loading failed for '{path}'")
    return {}

注释

如果要摆脱json输出中的@和#text标记，请使用参数attr_prefix=""和cdata_key=""
< / li>
通常xmltodict.parse()返回一个OrderedDict，但是您可以使用参数dict_constructor=dict

用法

path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))

# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml' 
# {
#   "mydocument": {
#     "@has": "an attribute",
#     "and": {
#       "many": [
#         "elements",
#         "more elements"
#       ]
#     },
#     "plus": {
#       "@a": "complex",
#       "#text": "element as well"
#     }
#   }
# }

来源

Answer 3

我认为您忘记定义编码类型了。我建议您尝试将xml文件初始化为字符串变量：

import xml.etree.ElementTree as ET
import xmltodict
import json


tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')

data_dict = dict(xmltodict.parse(xmlstr))

Answer 4

xmltodict似乎无法解析<?xml version="1.0" encoding="utf-8"?>

如果删除此行，则可以正常工作。

Answer 5

就我而言，问题在于前三个字符。因此删除它们是可行的：

import xmltodict
from xml.parsers.expat import ExpatError

with open('your_data.xml') as f:
    data = f.read()
    try:
        doc = xmltodict.parse(data)
    except ExpatError:
        doc = xmltodict.parse(data[3:])

xml.parsers.expat.ExpatError：格式不正确（无效令牌）

5 个答案:

Python 3

一个班轮

`.json`和`.xml`的助手

来源

xml.parsers.expat.ExpatError：格式不正确（无效令牌）

5 个答案:

Python 3

一个班轮

.json和.xml的助手

来源

`.json`和`.xml`的助手