BeautifulSoup没有正确地从h1返回

时间:2013-11-25 17:57:55

标签: python python-2.7 beautifulsoup

我的代码

from BeautifulSoup import BeautifulSoup

htmls = '''
<div class="main-content">
<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>
</div>
<div class="department">
... more text
</div>
'''
soup = BeautifulSoup(htmls)
h1 = soup.find("h1", {"class": "student"})
print h1

预期结果

<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>

但是,不幸的是返回

<h1 class="student">
</h1>

我的问题是,为什么它会在 p 标签之间吃掉所有东西?它是否正在执行 renderContents()?或者解析失败?

2 个答案:

答案 0 :(得分:1)

这是因为您在p标记内使用了h1标记。例如,如果你这样做:

>>> htmls
'\n<div class="main-content">\n<h1 class="student">\n    <p>Name: <br />\n    Alex</p>\n    <p>&nbsp;</p>\n</h1>\n</div>\n<div class="department">\n... more text\n</div>\n'
>>> soup = BeautifulSoup(htmls)
>>> soup

<div class="main-content">
<h1 class="student">
</h1><p>Name: <br />
    Alex</p>
<p>&nbsp;</p>

</div>
<div class="department">
... more text
</div>

你可以看到美丽的汤解析它略有不同。 p关闭后,<{1}}

然而,

h1

你可以看到孩子们。

这是HTML >>> htmls = ''' ... <div class="main-content"> ... <h1 class="student"> ... <span>Name: <br /> ... Alex</span> ... <span>&nbsp;</span> ... </h1> ... </div> ... <div class="department"> ... ... more text ... </div> ... ''' >>> >>> htmls.contents Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'contents' >>> soup = BeautifulSoup(htmls) >>> h1 = soup.find("h1", {"class": "student"}) >>> >>> h1 <h1 class="student"> <span>Name: <br /> Alex</span> <span>&nbsp;</span> </h1> 标记的行为方式。因此这个问题。 (详情请阅读block level elements

答案 1 :(得分:1)

尝试将另一个解析器传递给BeautifulSoup:

pip install html5lib

>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
...     <span>Name: <br />
...     Alex</span>
...     <span>&nbsp;</span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''

>>> soup = BeautifulSoup(htmls, 'html5lib')
>>> h1 = soup.find('h1', 'student')
>>> print h1
<h1 class="student">
    <p>Name: <br/>
    Alex</p>
    <p> </p>
</h1>

我想你想做什么。否则,您不应该在符合内部使用块元素。

请参阅:http://www.crummy.com/software/BeautifulSoup/bs4/doc/这用于插入解析器