Question

我正在编写此脚本，从http://example.com/下载HTML文档，并尝试使用以下方法将其解析为XML：

with urllib.request.urlopen("http://example.com/") as f:
    tree = xml.etree.ElementTree.parse(f)

但是，我一直收到ParseError: mismatched tag错误，据说是在第1行第2781行，所以我手动下载文件（在我的浏览器上按Ctrl + S）并检查它，但这样的位置表示在一个字符串的中间，甚至不在EOF附近，但在实际的第2781个字符之前有几行，这样可能会搞砸我对准确位置的计算。但是，我尝试下载并实际将响应写入文件以便稍后解析它：

response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)

我在同一列仍然遇到同样的mismatched tag错误，但这次我打开了下载的html，第2781栏附近的唯一内容是：

;</script></head><body class

确切的第2781列标志着</head>中的第一个“h”，那么这里可能出现什么问题？我错过了什么吗？

修改

我一直在寻找它并尝试使用另一个解析器解析XML，这次minidom，但我仍然在完全相同的行得到完全相同的错误，这可能是什么问题？即使我已经通过几种不同的方式下载文件（urllib，curl，wget，甚至是浏览器上的Ctrl + Save），结果也是如此。

编辑2：

这是我到目前为止所尝试的：

这是我从API文档中获得的一个示例xml，并将其保存到text.html：

<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>

我试过了：

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.parse(f)

然后它起作用了：

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.fromstring(f.read())

它也有效，但是：

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.parse(f)

不，也尝试过：

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.fromstring(f.read())

它也不起作用，可能是什么问题？据我所知，该文件没有不匹配的标签，但也许它太大了？它只有95.2 KB。

Answer 1

您可以使用$(document).height() - ($(window).scrollTop() + $(window).height());来解析此页面。像这样：

bs4

输出：

import bs4
import urllib


url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a

您可以从此link下载bs4。

Answer 2

根据urllib和ElementTree文档，此代码段似乎对您的示例网址没有错误。

import urllib.request
import xml.etree.ElementTree as ET

with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
    html = response.read()
    tree = ET.parse(html)

如果您不想在使用ElementTree解析变量之前将响应读入变量，这也有效：

with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
        tree = ET.parse(response.read())

解析XML时不匹配的标记错误？

2 个答案: