多线程python请求

时间:2016-07-30 08:08:26

标签: python python-requests

对于我的学士论文,我需要从大约40000个网站中获取一些数据。因此我使用的是python请求,但目前从服务器获取响应的速度非常慢。

无论如何加速并保持我当前的标题设置?我发现没有标题的所有教程。

这是我的代码剪辑:

def parse(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/39.0.2171.95 Safari/537.36'}
    r = requests.get(url, headers=headers)

    for line in r.iter_lines():
        ...

3 个答案:

答案 0 :(得分:1)

你必须使用线程,因为这是 I / O Bound 问题。使用内置的threading库是您的最佳选择。我使用Semaphore对象来计算现在正在运行的线程数。

import time
import threading

# Number of parallel threads
lock = threading.Semaphore(2)


def parse(url):
   """
   Change to your logic, I just use sleep to mock http request.
   """

    print 'getting info', url
    sleep(2)

    # After we done, subtract 1 from the lock
    lock.release()


def parse_pool():
    # List of all your urls
    list_of_urls = ['website1', 'website2', 'website3', 'website4']

    # List of threads objects I so we can handle them later
    thread_pool = []

    for url in list_of_urls:
        # Create new thread that calls to your function with a url
        thread = threading.Thread(target=parse, args=(url,))
        thread_pool.append(thread)
        thread.start()

        # Add one to our lock, so we will wait if needed.
        lock.acquire()

    for thread in thread_pool:
        thread.join()

    print 'done'

答案 1 :(得分:0)

您可以使用asyncio同时运行任务。您可以使用返回的asyncio.wait()值列出网址响应(已完成和待处理的响应)并异步调用协同程序。结果将是一个意想不到的顺序,但它是一种更快的方法。

import asyncio
import functools


async def parse(url):
    print('in parse for url {}'.format(url))

    info = await #write the logic for fetching the info, it waits for the responses from the urls

    print('done with url {}'.format(url))
    return 'parse {} result from {}'.format(info, url)


async def main(sites):
    print('starting main')
    parses = [
        parse(url)
        for url in sites
    ]
    print('waiting for phases to complete')
    completed, pending = await asyncio.wait(parses)

    results = [t.result() for t in completed]
    print('results: {!r}'.format(results))


event_loop = asyncio.get_event_loop()
try:
    websites = ['site1', 'site2', 'site3']
    event_loop.run_until_complete(main(websites))
finally:
    event_loop.close() 

答案 2 :(得分:-1)

我认为使用mutil-threadthreading之类的multiprocess是个好主意,或者由于grequests <而您可以使用gevent(异步请求)< / p>