python美丽的汤ascii错误

时间:2012-05-15 16:16:04

标签: python ascii beautifulsoup

我下载英文圣经时,我的脚本有效。但是当我下载一本外国圣经时,给我一个ascii错误。

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and converting Bibles to Aurora...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  namesave = '%s.html' % '.'.join(name.split('.')[:-1])
  chnum = name.split('.')[-2]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  try:
      f = urllib2.urlopen(url)
  except urllib2.URLError:
      print "Bad URL or timeout"
      continue
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  thearticle = soup.html.body.article
  bookname = thearticle['data-book-human']
  soup.html.replaceWith('<html>'+str(bookname)+'</html>')
  converted = str(soup)
  full_path = os.path.join(dirname, namesave)
  open(full_path, 'wb').write(converted)
  print(name)
print("DOWNLOADS AND CONVERSIONS COMPLETE!")

有效的links.html

<a href="http://www.youversion.com/bible/john.6.ceb">http://www.youversion.com/bible/john.6.ceb</a>

links.html提供错误

<a href="http://www.youversion.com/bible/john.6.nav">http://www.youversion.com/bible/john.6.nav</a>

错误

  File "test.py", line 32, in <module>
    soup.html.replaceWith('<html>'+str(bookname)+'</html>')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

1 个答案:

答案 0 :(得分:2)

之前我见过类似的错误,甚至可能是相同的。记不清楚。

尝试:

BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

或尝试强制使用unicode:

soup.html.replaceWith(u'<html>'+unicode(bookname)+u'</html>')