beautifulSoup不一致的行为

时间:2015-09-18 05:21:17

标签: python python-2.7 web-scraping beautifulsoup html-parsing

我完全对以下HTML编写代码的行为感到困惑,我在两个不同的环境中编写了需要帮助找到这种差异的根本原因

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

在机器1上,运行返回:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630  

在机器2上,这个完全相同的代码运行返回:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462

计算的重叠群数量不同。请注意,相同的代码会解析HTML表格,从而在两个不同的环境中产生不同的结果,这些环境彼此之间没有明显的不同,不幸的是导致了这种生产恶梦。手动检查确认机器2 上返回的结果不正确,但到目前为止无法解释。

有没有人有类似的经历?您是否注意到此代码有任何问题,或者我是否应该完全停止信任BeautifulSoup

1 个答案:

答案 0 :(得分:4)

对于您指定的“html”标记类型,您遇到differences between parsers BeaufitulSoup chooses automatically。选择哪个解析器取决于当前Python环境中可用的模块:

  

如果您没有指定任何内容,您将获得最佳的HTML解析器   安装。然后,Beautiful Soup将lxml的解析器列为最佳解析器   html5lib,然后是Python的内置解析器。

要在各个平台上保持一致的行为,请明确:

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

另请参阅:Installing a parser