如何从给定的网站中提取名称和链接 - python

时间:2021-03-19 19:13:49

标签: python-3.x selenium-webdriver beautifulsoup

对于下面提到的网站,我试图从该网站找到名称及其相应的链接。但根本无法传递/获取数据。

使用 BeautifulSoup

from bs4 import BeautifulSoup
import requests

source = requests.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

soup = BeautifulSoup(source.text, 'html.parser')
mains = soup.find_all("div", {"class": "list-container-wrapper"})

name = []
lnks = []

for main in mains:
        name.append(main.find("a").text)
        lnks.append(main.find("a").get('href'))

使用 Selenium 网络驱动程序

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"chromedriver_win32\chromedriver.exe")
driver.get("https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0")

lnks = []
name = []

for a in driver.find_elements_by_class_name('ng-star-inserted'):
    link = a.get_attribute('href')
    lnks.append(link)
    
    nm = driver.find_element_by_css_selector("#list-item-0 > div > h2 > a").text
    name.append(nm)

以上两种方法我都试过了。

示例:

name = ['Friday Night Flicks Drive-In at the Roadium', 'Open: Butterfly Pavilion and Nature Gardens']
lnks = ['https://mommypoppins.com/los-angeles-kids/event/in-person/friday-night-flicks-drive-in-at-the-roadium','https://mommypoppins.com/los-angeles-kids/event/in-person/open-butterfly-pavilion-and-nature-gardens']

2 个答案:

答案 0 :(得分:1)

这是 webdriver 的解决方案:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')

time.sleep(3)

elements = driver.find_elements(By.XPATH, "//a[@angularticsaction='expanded-detail']")

attributes = [{el.text: el.get_attribute('href')} for el in elements]

print(attributes)
print(len(attributes))

driver.quit()

这是使用 webdriver 和 bs4 的解决方案:

import time

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://mommypoppins.com/events/115/los-angeles/all/tag/all/age/all/all/deals/0/near/0/0')
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'html.parser')
mains = soup.find_all("a", {"angularticsaction": "expanded-detail"})

attributes = [{el.text: el.get('href')} for el in mains]

print(attributes)
print(len(attributes))

driver.quit()

这是请求的解决方案:

import requests

url = "https://mommypoppins.com"
response = requests.get(f"{url}/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all").json()


attributes = [{r.get('node_title'): f"{url}{r['node'][r['nid']]['node_url']}"} for r in response['results']]

print(attributes)
print(len(attributes))

干杯!

答案 1 :(得分:1)

该网站是动态加载的,因此 requests 将不支持它。但是,可以通过向以下地址发送 GET 请求以 JSON 格式获取数据:

https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all

不需要 BeautifulSoupSelenium,仅使用 requests 即可,这将使您的代码更快。

import requests

URL = "https://mommypoppins.com/contentasjson/custom_data/events_ng-block_1x/0/115/all/all/all/all/all"
BASE_URL = "https://mommypoppins.com"
response = requests.get(URL).json()

names = []
links = []

for json_data in response["results"]:
    data = json_data["node"][json_data["nid"]]
    names.append(data["title"])
    links.append(BASE_URL + data["node_url"])
相关问题