美丽的汤:评论标签而不是提取标签的最佳方法?

时间:2016-05-15 04:24:58

标签: python html beautifulsoup

我试图将以后想要的HTML页面部分注释掉,而不是使用漂亮的汤tag.extract()函数来提取它。例如:

<h1> Name of Article </h2> 
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2> 
<p>Html I want commented out</p>

我想要下面的所有内容,包括引用标题已注释掉。显然,我可以使用美丽的汤提取物特征来提取这样的东西:

soup = BeautifulSoup(data, "lxml")

references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
    elm.extract()
references.extract()

我也知道美丽的汤允许你可以使用评论创建功能

from bs4 import Comment

commented_tag = Comment(chunk_of_html_parsed_somewhere_else)
soup.append(commented_tag)

这似乎非常简单,只是简单地将html注释标记直接封装在特定标记之外,这是一种繁琐的方法,特别是如果标记位于厚html树的中间。是不是有一些更容易的方法,你可以在beautifulsoup上找到一个标签,只需在它之前和之后直接放置<!-- -->?提前致谢。

1 个答案:

答案 0 :(得分:1)

假设我正确理解了问题,您可以使用replace_with()将代码替换为Comment实例。这可能是评论现有标签的最简单方法:

import re

from bs4 import BeautifulSoup, Comment

data = """
<div>
    <h1> Name of Article </h2>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want blah blah blah</p>
    <h2> References </h2>
    <p>Html I want commented out</p>
</div>"""

soup = BeautifulSoup(data, "lxml")
elm = soup.find("h2", text=re.compile("References"))
elm.replace_with(Comment(str(elm)))

print(soup.prettify())

打印:

<html>
 <body>
  <div>
   <h1>
    Name of Article
   </h1>
   <p>
    First Paragraph I want
   </p>
   <p>
    More Html I'm interested in
   </p>
   <h2>
    Subheading in the article I also want
   </h2>
   <p>
    Even more Html i want blah blah blah
   </p>
   <!--<h2> References </h2>-->
   <p>
    Html I want commented out
   </p>
  </div>
 </body>
</html>