是否可以使用多台计算机加速网络刮刀?

时间:2018-06-01 21:05:29

标签: python python-3.x web-scraping python-requests grequests

通过让多台计算机有助于处理网址列表,有没有办法加快网络刮刀的速度?像计算机一样,网站需要1到500个网站,计算机B需要网址501 - 1000,等等。我正在寻找一种方法,用日常人可用的资源建立最快的网络刮刀。

我已经在grequests模块中使用多处理了。这是gevent + request的结合。

这种刮擦不需要经常运行,而是在每天早上(上午6点)的特定时间运行,并且一旦开始就在附近进行。我正在寻找快速准时的东西。

此外,我正在查看零售商店的网址(即:target,bestbuy,newegg等),并使用它查看当天库存的商品。

这是一个代码段,用于在我试图整理的脚本中抓取这些网址:

import datetime
import grequests
thread_number = 20
nnn = int(len(product_number_list)/100)
float_nnn = (len(product_number_list)/100)
# Product number list is a list of product numbers, too big for me to include the full list. Here are like three:
product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777']
base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}'
url_list = []
for number in product_number_list:
    url_list.append(base_url.format(product_number_list))
# The above three lines create a list of urls.
results = []
appended_number = 0
for x in range(0, len(product_number_list), thread_number):
    attempts = 0
    while attempts < 10:
        try:
            rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number])
            reqs = grequests.map(rs, stream=False, size=20)
            append = 'yes'
            for i in reqs:
                if i.status_code != 200:
                    append = 'no'
                    print('Bad Status Code. Nothing Appended.')
                    attempts += 1
                    break
            if append == 'yes':
                appended_number += 1
                results.extend(reqs)
                break
        except:
            print('Something went Wrong. Try Section Failed.')
            attempts += 1
            time.sleep(5)
    if appended_number % nnn == 0:
        now = datetime.datetime.today()
        print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p")))
    if attempts == 10:
        print('Failed ten times to get urls.')
        time.sleep(3600)
if len(results) != len(url_list):
    print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".')

这不是我的代码,它来自这两个链接:

Using grequests to make several thousand get requests to sourceforge, get "Max retries exceeded with url"

Understanding requests versus grequests

0 个答案:

没有答案