Question

我正试图从论坛帖子中获取所有帖子。一切都适用于大多数帖子，但每当帖子是回复并且它包含原始邮件时，我都无法得到回复。我发现soup.findAll（...）没有返回来自html源代码的所有子代（见下图）。从图2中的html我得到

<b>Citation :</b><br/>

我应该得到以下内容（在'p'标签中：si-si jesuislà...）。实际上我只想要这个'p'标签中的东西。

import requests
from bs4 import BeautifulSoup as bsoup

site_source = requests.get("http://forum.doctissimo.fr/sante/audition/acouphene-mouvement-secouant-sujet_152572_1.htm").content
soup = bsoup(site_source, "html.parser")

# Get text from forum posts

post_boxes = soup.findAll("td", class_="messCase2", style="border-bottom:0")

for post_box in post_boxes:
    message = post_box.find("div", itemprop="text")
    for line in message:
       print(line)

Picture: new post (the parsing works)

Picture: reply post

感谢您的帮助。

Answer 1

使用.descendants递归访问Tag对象中的所有子项。

但是，有一个更大的问题。您提供的链接不提供与“回复帖子”屏幕截图相同的HTML结构。在实际链接中，包含所需内容的其他<p>标记不属于<div> id=77714的一部分。以下是此页面的Chrome's Inspect选项卡的屏幕截图，其中包含您感兴趣的内容：

<div>似乎未正确关闭，但它嵌套在<p>标记中，在下一个<div class="container">之前关闭。（这是发布问题时不依赖屏幕截图的一个很好的理由。）

您的浏览器可能已将悬空</p>归为<p></p>，但BeautifulSoup尊重明显的标记关闭 - 这就是您在子集中未看到目标内容的原因。

以下是<div>实际包含的内容：

import requests
from bs4 import BeautifulSoup as bsoup

site_source = requests.get("http://forum.doctissimo.fr/sante/audition/acouphene-mouvement-secouant-sujet_152572_1.htm").content
soup = bsoup(site_source, "html.parser")

div = soup.find_all("div", id="para77714")

for tag in div:
    for subtag in tag.descendants:
        print(subtag)

输出：

<div itemprop="text"><b>Citation :</b><br/></div>
<b>Citation :</b>
Citation :
<br/>

主要问题是此页面上的HTML很乱。 BeautifulSoup完全正确无法做出正面或反面。例如，看起来就像你的目标内容被包装在<p>标签中一样，但是由于之前的事情都被错误的闭包搞砸了，实际上并没有一个有效的标签包装你的内容。

如果您在标记层次结构中向下遍历起始<div>，则可以看到这一点：

ct = 0
for tag in div.find_all_next():
    ct+=1
    if (ct < 15) & (ct > 12):
        for subtag in tag:
            print(f"tag name: {subtag.name}")
            print(f"{subtag.string}\n")

tag name: None
Afficher moins

tag name: br
None

tag name: br
None

tag name: br
None

tag name: None
si-si je suis là   <--- this is what you want, but it is not in a valid <p> tag

tag name: img
None
# ...

TL; DR - BeautifulSoup工作正常，这个网页被破坏了。

BeautifulSoup - findAll（）不会返回所有后代

1 个答案: