为什么这个解析来自BeautifulSoup抛出错误

时间:2015-01-08 13:39:05

标签: python beautifulsoup html-parsing

我有这个HTML源代码: - http://pastebin.com/itMYaimq。我正在运行以下BeautifulSoup命令来解析HTML

def check_img(self, feed):
        return 1 if feed.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: 'data' not in x}) else 0

此处feed是HTML源代码。

执行时抛出。

[2015-01-08 10:19:16,415: WARNING/Worker-2] Traceback (most recent call last):
[2015-01-08 10:19:16,415: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/data_processors/rule_processor.py", line 58, in do_akamai_analysis
[2015-01-08 10:19:16,416: WARNING/Worker-2] resp, self.analysis.url, self.analysis.id)
[2015-01-08 10:19:16,416: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/rules.py", line 794, in akamai_rule_analysis
[2015-01-08 10:19:16,416: WARNING/Worker-2] result[RULES.FEO_CHECKS] = check_feo_optimizations(analysis_id, url)
[2015-01-08 10:19:16,417: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/rules.py", line 1320, in check_feo_optimizations
[2015-01-08 10:19:16,417: WARNING/Worker-2] return FEO_processor.FEOProcessor().process_feo_debug_output(analysis_id, url)
[2015-01-08 10:19:16,417: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/data_processors/FEO_processor.py", line 38, in process_feo_debug_output
[2015-01-08 10:19:16,417: WARNING/Worker-2] self.result[name] = (False, True)[getattr(self,func)(feed)]
[2015-01-08 10:19:16,418: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/data_processors/FEO_processor.py", line 64, in check_img
[2015-01-08 10:19:16,418: WARNING/Worker-2] return 1 if feed.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: 'data' not in x}) else 0
[2015-01-08 10:19:16,418: WARNING/Worker-2] File "/Library/Python/2.7/site-packages/bs4/element.py", line 1180, in find_all
[2015-01-08 10:19:16,419: WARNING/Worker-2] return self._find_all(name, attrs, text, limit, generator, **kwargs)
[2015-01-08 10:19:16,419: WARNING/Worker-2] File "/Library/Python/2.7/site-packages/bs4/element.py", line 505, in _find_all
[2015-01-08 10:19:16,419: WARNING/Worker-2] found = strainer.search(i)
[2015-01-08 10:19:16,420: WARNING/Worker-2] File "/Library/Python/2.7/site-packages/bs4/element.py", line 1540, in search
[2015-01-08 10:19:16,420: WARNING/Worker-2] found = self.search_tag(markup)
[2015-01-08 10:19:16,420: WARNING/Worker-2] File "/Library/Python/2.7/site-packages/bs4/element.py", line 1512, in search_tag
[2015-01-08 10:19:16,421: WARNING/Worker-2] if not self._matches(attr_value, match_against):
[2015-01-08 10:19:16,421: WARNING/Worker-2] File "/Library/Python/2.7/site-packages/bs4/element.py", line 1578, in _matches
[2015-01-08 10:19:16,421: WARNING/Worker-2] return match_against(markup)
[2015-01-08 10:19:16,421: WARNING/Worker-2] File "/Users/rokumar/SiteAnalysisGit/Src/hct/hct/data_processors/FEO_processor.py", line 64, in <lambda>
[2015-01-08 10:19:16,422: WARNING/Worker-2] return 1 if feed.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: 'data' not in x}) else 0
[2015-01-08 10:19:16,422: WARNING/Worker-2] TypeError: argument of type 'NoneType' is not itterable

我打印了feed以查看它的价值。它打印了HTML源代码,因此它不是None。那么为什么我会将此错误视为argument of type 'NoneType' is not iterable

1 个答案:

答案 0 :(得分:2)

您的src lambda正在针对None进行测试:

>>> x = None
>>> 'data' not in x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument of type 'NoneType' is not iterable

当您尝试验证没有<img>属性的src标记时会发生这种情况;你的输入源有8个这样的标签:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(requests.get('http://pastebin.com/raw.php?i=itMYaimq').content)
>>> len(soup.find_all('img', src=False))
8

只需测试一下:

lambda x: x and 'data' not in x

您的测试可以简化;没有必要找到所有匹配,只需要第一个匹配:

blzsrc_image = feed.find('img', attrs={'data-blzsrc': True, 'src': lambda x: 'data' not in x})
return 1 if blzsrc_image else 0

如果布尔值可以(而不是10),您可以使用:

blzsrc_image = feed.find('img', attrs={'data-blzsrc': True, 'src': lambda x: 'data' not in x})
return blzsrc_image is not None