如何使用beautifullsoup从此站点抓取数据

时间:2020-04-09 12:06:35

标签: python python-3.x web-scraping beautifulsoup

import requests
import bs4
html_page = requests.get(
    'https://homeshopping.pk/categories/Mobile-Phones-Price-Pakistan')
html_page.raise_for_status()
soup = bs4.BeautifulSoup(html_page.text, features='lxml')
h = soup.find('div','ProductList')
print(h)

,但它返回空对象。如何通过此链接获取产品价格

1 个答案:

答案 0 :(得分:0)

价格放在“ ActualPrice”类的div中。要获取所有此类div元素,您可以使用:

soup.find_all('div', class_='ActualPrice')

要获取价格和产品详细信息,您可以执行以下操作:

import requests
import bs4
html_page = requests.get(
    'https://homeshopping.pk/categories/Mobile-Phones-Price-Pakistan')
html_page.raise_for_status()
soup = bs4.BeautifulSoup(html_page.text, features='lxml')
products = soup.find_all('div', class_='product-box')
for product in products[:3]: #for the first 3 products
    product_name = product.find('h5', class_='ProductDetails')
    print(product_name.text)
    product_price = product.find('div', class_='ActualPrice')
    print(product_price.text)

#Output
Apple iPhone XS (4G, 64GB, Gold) - PTA Approved.
Rs 131,999
Apple iPhone XS Max (4G, 256GB Gold) - PTA Approved
Rs 154,999
Oppo A5 2020 Dual Sim (4G, 4GB RAM, 128Gb ROM, Mirror Black) With 1 Year Official Warranty 
Rs 31,599

当您从页面上向下滚动JS时,生成的URL请求如下所示: https://homeshopping.pk/categories/Mobile-Phones-Price-Pakistan?page=1&AjaxRequest=1

要从该页面获取所有电话,您可以遍历所有电话:

import requests
import bs4
page_number = 1
more_products = True
while more_products:
    html_page = requests.get(
        'https://homeshopping.pk/categories/Mobile-Phones-Price-Pakistan?page={}&AjaxRequest=1'.format(page_number))
    html_page.raise_for_status()
    soup = bs4.BeautifulSoup(html_page.text, features='lxml')
    products = soup.find_all('div', class_='product-box')
    if not products:
        more_products = False
    for product in products[0]: #for the first product in every request
        product_name = product.find('h5', class_='ProductDetails')
        print(product_name.text)
        product_price = product.find('div', class_='ActualPrice')
        print(product_price.text)
    page_number += 1