Question

我正试图在<h2>和</h2>之间获取所有内容。像这样：

<h2> Header 1 </h2>
This is an example text for <a href="https://example.com">site</a>
Any HTML-Code can appear 
<br />
<p>
<h2> Header 2 </h2>
Some other text with no tags
<h2> Header 3 </h2>

结果应该是：

This is an example text for <a href="https://example.com">site</a>
Any HTML-Code can appear 
<br>
<p>

和

Some other text with no tags

有人能把我推向正确的方向吗？

Answer 1

我会去分解。

while soup.find("h2") != None: # the find method returns the found element
    soup.h2.decompose() 

>>> \nThis is an example text for <a href="https://example.com">site</a>\nAny HTML-Code can appear \n<br>\n<p>\n\nSome other text with no tags\n</p></br>

或者更巧妙地说：

soup.h2.decompose()
second_text = soup.h2.next_sibling
while soup.find("h2") != None:
    soup.h2.decompose()

print soup, second_text


>>> This is an example text for <a href="https://example.com">site</a>
    Any HTML-Code can appear 
    <br>
    <p>

    Some other text with no tags
    </p></br> 
    Some other text with no tags

Answer 2

感谢您的提示，但这并不是我所要求的。我可以告诉你更少的信息。

本文前后有很多内容，我只想在</h2>和<h2>之间插入文字

如果我使用decompose（）它只删除h2-Tags，但所有其他东西仍然存在。我的问题类似于那个：Extracting text without tags of HTML with Beautifulsoup Python

我找到了一个可能的解决方案：

content = soup.find_all("div",class_="class")
begin = str(content).find("Header 1</h2>")
end = str(content).find("<h2>Header 2")
print(str(content)[begin:end])

使用beatifulsoup在标签之间获取内容

2 个答案: