从 shopify 网站抓取产品 - 意外结果

时间:2021-01-07 14:49:40

标签: python html web-scraping python-requests shopify

所以我一般不熟悉编码,但对于我的第一个项目,我尝试创建一个监视器来监控 Shopify 网站的产品更改。

我的方法是在线获取公开共享的代码,然后从那里向后工作以理解它,所以我在更广泛的类中得到了以下代码,它似乎通过循环浏览页面来获取 products.json。

但是当我加载 https://www.hanon-shop.com/collections/all/products.json 然后在下面打印我的商品列表时,前几个产品是不同的,这有什么意义?

def scrape_site(self):
        """
        Scrapes the specified Shopify site and adds items to array
        :return: None
        """
        self.items = []
        s = rq.Session()
        page = 1
        while page > 0:
            try:
                html = s.get(self.url + '?page=' + str(page) + '&limit=250', headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                output = json.loads(html.text)['products']
                if output == []:
                    page = 0
                else:
                    for product in output:
                        product_item = [{'title': product['title'], 'image': product['images'][0]['src'], 'handle': product['handle'], 'variants':product['variants']}]
                        self.items.append(product_item)
                    logging.info(msg='Successfully scraped site')
                    page += 1
            except Exception as e:
                logging.error(e)
                page = 0
            time.sleep(0.5)
        s.close()

1 个答案:

答案 0 :(得分:0)

Requests 接受一个 dict 参数并且还有一个 json 方法,所以这可以更清晰。

import time
import requests


def scrape_site(self):
    self.items = []
    page = 1

    with requests.Session() as s:
        while True:
            params = {
              'page': page,
              'limit': 250
            }
        
            try:
                r = s.get(self.url, params=params, headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
                r.raise_for_status()
                output = r.json()
                if not output:
                    break
                for product in output['products']:
                    product_item = {
                        'title': product['title'], 
                        'image': product['images'][0]['src'], 
                        'handle': product['handle'], 
                        'variants':product['variants']
                    }
                    self.items.append(product_item)
                logging.info(f'Successfully scraped page {page}')
                page += 1
                time.sleep(1)
                
            except Exception as e:
                logging.error(e)
                break

    return self.items