我试图将以后想要的HTML页面部分注释掉,而不是使用漂亮的汤tag.extract()函数来提取它。例如:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
我想要下面的所有内容,包括引用标题已注释掉。显然,我可以使用美丽的汤提取物特征来提取这样的东西:
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
我也知道美丽的汤允许你可以使用评论创建功能
from bs4 import Comment
commented_tag = Comment(chunk_of_html_parsed_somewhere_else)
soup.append(commented_tag)
这似乎非常简单,只是简单地将html注释标记直接封装在特定标记之外,这是一种繁琐的方法,特别是如果标记位于厚html树的中间。是不是有一些更容易的方法,你可以在beautifulsoup上找到一个标签,只需在它之前和之后直接放置<!-- -->
?提前致谢。
答案 0 :(得分:1)
假设我正确理解了问题,您可以使用replace_with()
将代码替换为Comment
实例。这可能是评论现有标签的最简单方法:
import re
from bs4 import BeautifulSoup, Comment
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
</div>"""
soup = BeautifulSoup(data, "lxml")
elm = soup.find("h2", text=re.compile("References"))
elm.replace_with(Comment(str(elm)))
print(soup.prettify())
打印:
<html>
<body>
<div>
<h1>
Name of Article
</h1>
<p>
First Paragraph I want
</p>
<p>
More Html I'm interested in
</p>
<h2>
Subheading in the article I also want
</h2>
<p>
Even more Html i want blah blah blah
</p>
<!--<h2> References </h2>-->
<p>
Html I want commented out
</p>
</div>
</body>
</html>