美丽的汤解码错误

时间:2013-09-24 06:23:47

标签: python html beautifulsoup

我正在工作,我需要用Beautiful Soup解析一个网站。该网站是http://www.manta.com,但当我尝试在HTML代码的元数据中看到网站的编码时,什么也没有显示。我尝试在本地解析HTML,下载了网页,但是我遇到了一些解码错误:

# manta web page downloaded before
html = open('1.html', 'r')
soup = BeautifulSoup(html, 'lxml')

这会产生以下堆栈跟踪:

Traceback (most recent call last):
  File "E:/Projects/Python/webkit/sample.py", line 10, in <module>
    soup = BeautifulSoup(html, 'lxml')
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
    self._feed()
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
    self.parser.close()
  File "parser.pxi", line 1209, in 
    lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717)
  File "parsertarget.pxi", line 142, in  
    lxml.etree._TargetParserContext._handleParseResult  (src\lxml\lxml.etree.c:100104)
  File "parsertarget.pxi", line 130, in 
    lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927)
  File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored 
    (src\lxml\lxml.etree.c:9387)
  File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml
    \lxml.etree.c:96065)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data

我尝试在Beautiful Soup的构造函数中引入编码:

soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding")

我继续得到同样的错误。

有趣的是,如果我在浏览器中加载页面然后将编码更改为utf-8,例如在Firefox中并保存。这项工作很好。非常感谢任何帮助。谢谢。

1 个答案:

答案 0 :(得分:1)

以UTF-8

编码字符串
soup = BeautifulSoup(html.encode('UTF-8'),'lxml')