使用Python和lxml来针对外部DTD验证XML

时间:2014-03-13 22:18:04

标签: python xml validation lxml dtd

我尝试针对doctype标记中引用的外部DTD验证XML文件。具体做法是:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
...the rest of the document...

我使用的是Python 3.3和lxml模块。从阅读http://lxml.de/validation.html#validation-at-parse-time开始,我就把它扔到了一起:

enexFile = open(sys.argv[2], mode="rb") # sys.argv[2] is the path to an XML file in local storage.
enexParser = etree.XMLParser(dtd_validation=True)
enexTree = etree.parse(enexFile, enexParser)

根据我对validation.html的理解,lxml库现在应该负责检索DTD并执行验证。但相反,我得到了这个:

$ ./mapwrangler.py validate notes.enex
Traceback (most recent call last):
  File "./mapwrangler.py", line 27, in <module>
    enexTree = etree.parse(enexFile, enexParser)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: Validation failed: no DTD found !, line 3, column 43

这让我感到惊讶,因为如果我关闭验证,那么文档解析得很好,我可以print(enexTree.docinfo.doctype)来获取

$ ./mapwrangler.py validate notes.enex
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">

因此,在我看来,找到DTD不会有任何问题。

感谢您的帮助。

2 个答案:

答案 0 :(得分:2)

构造解析器对象时需要添加no_network=False。默认情况下,此选项设置为True

来自http://lxml.de/parsing.html#parsers的解析器选项文档:

  

no_network - 在查找外部文档时阻止网络访问(默认情况下已启用)

答案 1 :(得分:0)

由于我仍然不知道的原因,我的问题与XML目录在我的本地文件系统上的位置有关。

就我而言,我使用的XML编辑器与组件内容管理系统(CCMS,在本例中为SDL Trisoft 2011 R2)紧密集成。当编辑器连接到CCMS时,DTD,目录文件和一堆其他文件会同步。这些文件最终出现在本地文件系统中:

C:\Users\[username]\AppData\Local\Trisoft\InfoShare Client\[id]\Config\DocTypes\catalog.xml

我无法让它发挥作用。只需将整个目录复制到另一个固定的位置,这样就可以了:

f = r"path/to/my/file.xml"
# set XML catatog file path
os.environ['XML_CATALOG_FILES'] = r'C:\DATA\Mydoctypes\catalog.xml'
# configure parser
parser = etree.XMLParser(dtd_validation=True, no_network=True)
# validate
try:
   valid = etree.parse(f, parser=parser)
    print("This file is valid against the DTD.")
except etree.XMLSyntaxError, error:
   print("This file is INVALID against the DTD!")
   print(error)

显然这不太理想,但它确实有效。

是否与文件权限有关,或者Windows中可能存在旧的“文件路径太长”问题?我还没有尝试过符号链接是否有用。

我使用的是Windows 7,Python 2.7.11,而lxml的版本是(3.6.0)。