Question

我正在使用Python 2，我有以下脚本：

from bs4 import BeautifulSoup
import requests, re

page = "http://hidden.com/example"
headers = {'User-Agent': 'Craig'}
html = requests.post(page, headers=headers)

soup = BeautifulSoup(html.text, "html.parser")

final = soup.find('p',{'class':'text'})

print final

这适用于我不会公开发布的网站，它会返回此内容。

<p>Example text <a href="example">Example</a> more example <a href="second example">Second example</a></p>

如何删除<p>和<a href="">代码？还有潜伏的其他标签？

Answer 1

大多数bs4代码都有一个.strings属性，该属性是代码中所有字符串的生成器。

print(''.join(final.strings))
# Example text Example more example Second example

Answer 2

我建议你使用正则表达式检查html标签并用空字符串替换它们。

reg = r＆＃39; \＆lt; \ * [^＆gt;] +＆gt;＆＃39; 。这似乎有效。

从输出Python中删除HTML标记

2 个答案: