Question

我正在编写一个通用的html解析器，并希望能够从给定标签中提取所有标签。因为外部标签是一个通用解析器，所以外部标签可能包含一个或多个内部标签，并且它们可能只是任何html标签，因此我无法使用诸如find之类的方法。我也尝试使用.contents，但是它以列表形式返回结果，但是我只是想要这些标签，以便可以将它们进一步解析为bs4标签。

例如：给出以下html：

<tr><th>a</th><th>b</th></tr>

我需要提取以下内容，同时确保其仍为bs4标签类型

<th>a</th><th>b</th>

Answer 1

为什么不使用没有参数的find_all()方法？

from bs4 import BeautifulSoup as soup

html = """<div><tr><th>a</th><th>b</th></tr></div>"""

page = soup(html,"html.parser")

div = page.find('div')

print('Get all tag occurences')
print(div.find_all())

print('Get only the inside tag, without duplicate')
print(div.find_all()[0])

输出：

Get all tag occurences
[<tr><th>a</th><th>b</th></tr>, <th>a</th>, <th>b</th>]

Get only the inside tag, without duplicate
<tr><th>a</th><th>b</th></tr>

如何使用beautifulsoup从html标记中提取所有标记

1 个答案: