Question

我的代码

from BeautifulSoup import BeautifulSoup

htmls = '''
<div class="main-content">
<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>
</div>
<div class="department">
... more text
</div>
'''
soup = BeautifulSoup(htmls)
h1 = soup.find("h1", {"class": "student"})
print h1

预期结果

<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>

但是，不幸的是返回

<h1 class="student">
</h1>

我的问题是，为什么它会在 p 标签之间吃掉所有东西？它是否正在执行 renderContents（）？或者解析失败？

Answer 1

这是因为您在p标记内使用了h1标记。例如，如果你这样做：

>>> htmls
'\n<div class="main-content">\n<h1 class="student">\n    <p>Name: <br />\n    Alex</p>\n    <p>&nbsp;</p>\n</h1>\n</div>\n<div class="department">\n... more text\n</div>\n'
>>> soup = BeautifulSoup(htmls)
>>> soup

<div class="main-content">
<h1 class="student">
</h1><p>Name: <br />
    Alex</p>
<p>&nbsp;</p>

</div>
<div class="department">
... more text
</div>

你可以看到美丽的汤解析它略有不同。 p关闭后，<{1}} 。

然而，

h1

你可以看到孩子们。

这是HTML >>> htmls = ''' ... <div class="main-content"> ... <h1 class="student"> ... <span>Name: <br /> ... Alex</span> ... <span> </span> ... </h1> ... </div> ... <div class="department"> ... ... more text ... </div> ... ''' >>> >>> htmls.contents Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'contents' >>> soup = BeautifulSoup(htmls) >>> h1 = soup.find("h1", {"class": "student"}) >>> >>> h1 <h1 class="student"> <span>Name: <br /> Alex</span> <span> </span> </h1>标记的行为方式。因此这个问题。（详情请阅读block level elements）

Answer 2

尝试将另一个解析器传递给BeautifulSoup：

pip install html5lib

>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
...     <span>Name: <br />
...     Alex</span>
...     <span>&nbsp;</span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''

>>> soup = BeautifulSoup(htmls, 'html5lib')
>>> h1 = soup.find('h1', 'student')
>>> print h1
<h1 class="student">
    <p>Name: <br/>
    Alex</p>
    <p> </p>
</h1>

我想你想做什么。否则，您不应该在符合内部使用块元素。

请参阅：http://www.crummy.com/software/BeautifulSoup/bs4/doc/这用于插入解析器

BeautifulSoup没有正确地从h1返回

2 个答案: