如何使BeautifulSoup“理解”加号html实体

时间:2019-04-12 11:11:17

标签: python beautifulsoup

假设我们有一个html文件,如下所示:

test.html

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>

如果我将此html传递给BeautifulSoup,它将逃避&实体附近的plus符号,并且输出html将是这样的:

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &amp;plus 4 = 6<br>
2 &lt; 4 = True
</div>

示例python3代码:

from bs4 import BeautifulSoup

with open('test.html', 'rb') as file:
    soup = BeautifulSoup(file, 'html.parser')

print(soup)

如何避免这种行为?

1 个答案:

答案 0 :(得分:3)

阅读不同解析器库的描述:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

这可以解决您的问题:

s = '''
<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>'''

soup = BeautifulSoup(s, 'html5lib')

您会得到:

>>> soup

<html><head></head><body><div>
<i>Some text here.</i>
Some text here also.<br/>
2 + 4 = 6<br/>
2 &lt; 4 = True
</div></body></html>