我的代码
from BeautifulSoup import BeautifulSoup
htmls = '''
<div class="main-content">
<h1 class="student">
<p>Name: <br />
Alex</p>
<p> </p>
</h1>
</div>
<div class="department">
... more text
</div>
'''
soup = BeautifulSoup(htmls)
h1 = soup.find("h1", {"class": "student"})
print h1
预期结果
<h1 class="student">
<p>Name: <br />
Alex</p>
<p> </p>
</h1>
但是,不幸的是返回
<h1 class="student">
</h1>
我的问题是,为什么它会在 p 标签之间吃掉所有东西?它是否正在执行 renderContents()?或者解析失败?
答案 0 :(得分:1)
这是因为您在p
标记内使用了h1
标记。例如,如果你这样做:
>>> htmls
'\n<div class="main-content">\n<h1 class="student">\n <p>Name: <br />\n Alex</p>\n <p> </p>\n</h1>\n</div>\n<div class="department">\n... more text\n</div>\n'
>>> soup = BeautifulSoup(htmls)
>>> soup
<div class="main-content">
<h1 class="student">
</h1><p>Name: <br />
Alex</p>
<p> </p>
</div>
<div class="department">
... more text
</div>
你可以看到美丽的汤解析它略有不同。 p
关闭后,<{1}} 。
然而,
h1
你可以看到孩子们。
这是HTML >>> htmls = '''
... <div class="main-content">
... <h1 class="student">
... <span>Name: <br />
... Alex</span>
... <span> </span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>>
>>> htmls.contents
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contents'
>>> soup = BeautifulSoup(htmls)
>>> h1 = soup.find("h1", {"class": "student"})
>>>
>>> h1
<h1 class="student">
<span>Name: <br />
Alex</span>
<span> </span>
</h1>
标记的行为方式。因此这个问题。 (详情请阅读block level elements
)
答案 1 :(得分:1)
尝试将另一个解析器传递给BeautifulSoup:
pip install html5lib
>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
... <span>Name: <br />
... Alex</span>
... <span> </span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>> soup = BeautifulSoup(htmls, 'html5lib')
>>> h1 = soup.find('h1', 'student')
>>> print h1
<h1 class="student">
<p>Name: <br/>
Alex</p>
<p> </p>
</h1>
我想你想做什么。否则,您不应该在符合内部使用块元素。
请参阅:http://www.crummy.com/software/BeautifulSoup/bs4/doc/这用于插入解析器