美丽的汤和uTidy

时间:2009-05-20 06:05:39

标签: python screen-scraping beautifulsoup tidy

我想将utidy的结果传递给Beautiful Soup,ala:

page = urllib2.urlopen(url)
options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0)
cleaned_html = tidy.parseString(page.read(), **options)
soup = BeautifulSoup(cleaned_html)

运行时,会出现以下错误:

Traceback (most recent call last):
  File "soup.py", line 34, in <module>
    soup = BeautifulSoup(cleaned_html)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1245, in _feed
    smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1751, in __init__
    self._detectEncoding(markup, isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1899, in _detectEncoding
    xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
TypeError: expected string or buffer

我收集utidy返回一个XML文档,而BeautifulSoup想要一个字符串。有没有办法来cast_html?或者我做错了,应采取不同的方法吗?

2 个答案:

答案 0 :(得分:11)

只需将str()包裹在cleaned_html左右  当它传递给BeautifulSoup时。

答案 1 :(得分:2)

将传递给BeautifulSoup的值转换为字符串。 在您的情况下,请执行以下编辑到最后一行:

soup = BeautifulSoup(str(cleaned_html))