Question

我正在尝试从具有Python BeautifulSoup库的网站中提取一些信息。特别是我想从以下ccs代码中提取信息：

<span class="g47SY ">68</span>

使用find_all命令不起作用，并且我不理解该错误。你能帮我吗？

这是我的代码

import requests
from bs4 import BeautifulSoup

url = 'https://www.exemple.com/'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')

# print(soup.prettify())

# I want to extract 68 from <span class="g47SY ">68</span>
info = soup.find_all("span", class_="g47SY")
print (info)

Answer 1

在找到HTML页面上的元素方面，您的代码是正确的。问题在于Instagram页面本身。如果查看其来源（而不是“ DevTools元素”面板），您会发现它几乎是空白的。 Instagram完全使用JavaScript构建（这是一种反模式，但根深蒂固），因此您要查找的元素仅在JavaScript运行后才存在于客户端中。

您可以使用Selenium来执行此操作，该操作基本上是在浏览器中打开网站并执行普通浏览器可以执行的所有操作。例如：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# initialization
driver = webdriver.Firefox()
driver.get("https://www.instagram.com/antedoro/")

try:
    # wait up to 10 seconds for the parent of the spans to be present
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "Y8-fY")))
    # locate the spans
    spans = driver.find_elements_by_css_selectors("span.g47SY")
    text_of_spans = [span.text for span in spans]
finally:
    driver.close()

Answer 2

find_all返回一个列表，因此您需要选择第一项。然后使用text属性。像这样：

# I want to extract 68 from <span class="g47SY ">68</span>
info = soup.find_all("span", class_="g47SY")
print(info[0].text)

（为什么要投票？我刚刚测试过，可以在bs4中使用）

使用BeautifulSoup

2 个答案: