Question

我试图解析一个非常广泛的HTML文档，如下所示：

<div class="reportsubsection n" ><br>
   <h2> part 1 </h2><br>
   <p> insert text here </p><br>
  <table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
   <h2> part 2 </h2><br>
   <p> insert text here </p><br>
   <table> crazy table thing here </table><br>
</div>

需要根据具有文本＆＃34;第2部分＆＃34;的h2解析出第二个div。我能够通过以下方式打破所有div：

divTag = soup.find("div", {"id": "reportsubsection"})

但不知道如何从那里减少它。我发现的其他帖子我能够找到特定的文字＆＃34;第2部分，但我需要能够输出它所包含的整个DIV部分。

EDIT / UPDATE

好的抱歉，但我还是有点失落。这就是我现在所拥有的。我觉得这应该比我制作它简单得多。再次感谢所有帮助

divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
    if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
        continue<br>
print divTag

Answer 1

找到正确的h2后，您可以随时返回向上，或者您可以测试所有小节：

for subsection in soup.select('div#reportsubsection #subsection'):
    if not subsection.find('h2', text=re.compile('part 2')):
        continue
    # do something with this subsection

这使用CSS selector来查找所有subsection。

或者，回到.parent attribute：

for header in soup.find_all('h2', text=re.compile('part 2')):
    section = header.parent

诀窍是尽早缩小搜索范围;第二个选项必须找到整个文档中的所有h2元素，而前者则会更快地缩小搜索范围。

Python / Beautiful Soup找到特定的标题输出完整的div

1 个答案: