美丽的汤错误

时间:2011-12-20 11:23:35

标签: python beautifulsoup

我正在使用漂亮的汤模块来刮取保存在csv中的网页列表的标题。该脚本似乎工作正常,但一旦到达第82个域,它会产生以下错误:

Traceback (most recent call last):
  File "soup.py", line 31, in <module>
    print soup.title.renderContents() # 'Google'
AttributeError: 'NoneType' object has no attribute 'renderContents'

我对python很新,所以我不确定我是否理解错误,是否有人能够澄清出现了什么问题?

我的代码是:

import csv
import socket
from urllib2 import Request, urlopen, URLError, HTTPError
from BeautifulSoup import BeautifulSoup

debuglevel = 0

timeout = 5

socket.setdefaulttimeout(timeout) 
domains = csv.reader(open('domainlist.csv'))
f = open ('souput.txt', 'w')
for row in domains:
domain = row[0]
req = Request(domain)
try:
    html = urlopen(req).read()
    print domain
except HTTPError, e:
    print 'The server couldn\'t fulfill the request.'
    print 'Error code: ', e.code
except URLError, e:
    print 'We failed to reach a server.'
    print 'Reason: ', e.reason
else:
    # everything is fine
    soup = BeautifulSoup(html)

    print soup.title # '<title>Google</title>'
    print soup.title.renderContents() # 'Google'
    f.writelines(domain)
    f.writelines("  ")
    f.writelines(soup.title.renderContents())
    f.writelines("\n")

3 个答案:

答案 0 :(得分:1)

如果页面没有标题怎么办? 我曾经遇到过这个问题....只是将代码放入try中,或者检查标题。

答案 1 :(得分:1)

正如maozet所说,你的问题是标题是无,你可以检查该值以避免这样的问题:

soup = BeautifulSoup(html)

if soup.title != None:
    print soup.title # '<title>Google</title>'
    print soup.title.renderContents() # 'Google'
    f.writelines(domain)
    f.writelines("  ")
    f.writelines(soup.title.renderContents())
    f.writelines("\n")

答案 2 :(得分:0)

我遇到了同样的问题,但阅读了几个相关的问题和谷歌搜索帮助我完成了。以下是我建议处理特定错误的内容,例如NoneType:

soup = BeautifulSoup(urllib2.urlopen('http://webpage.com').read())
scrapped = soup.find(id='whatweseekfor')

if scrapped == None:
    # command when encountering an error eg: print none

elif scrapped != None:
    # command when there is no None type error eg: print scrapped.get_text()
祝你好运!