Question

我正在使用Anaconda - Python 3.5.2

我有一个280,000个网址列表。我抓住数据并试图跟踪网址到数据。

我提出了大约30K的请求。我平均每秒1次请求。

response_df = pd.DataFrame()
# create the session
with requests.Session() as s:
    # loop through the list of urls
    for url in url_list:
        # call the resource
        resp = s.get(url)
        # check the response
        if resp.status_code == requests.codes.ok:
            # create a new dataframe with the response            
            ftest = json_normalize(resp.json())
            ftest['url'] = url
            response_df = response_df.append(ftest, ignore_index=True)
        else:
            print("Something went wrong! Hide your wife! Hide the kids!")

response_df.to_csv(results_csv)

Answer 1

我最终放弃了请求，我使用了async和aiohttp。请求我每秒大约拉1次。新方法平均每秒约5次，仅占用我系统资源的20％左右。我最终使用了与此类似的东西： https://www.blog.pythonlibrary.org/2016/07/26/python-3-an-intro-to-asyncio/

import aiohttp
import asyncio
import async_timeout
import os

async def download_coroutine(session, url):
    with async_timeout.timeout(10):
        async with session.get(url) as response:
            filename = os.path.basename(url)
            with open(filename, 'wb') as f_handle:
                while True:
                    chunk = await response.content.read(1024)
                    if not chunk:
                        break
                    f_handle.write(chunk)
            return await response.release()

async def main(loop):
    urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040a.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040ez.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040es.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"]

async with aiohttp.ClientSession(loop=loop) as session:
    for url in urls:
        await download_coroutine(session, url)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(loop))

另外，这很有帮助： https://snarky.ca/how-the-heck-does-async-await-work-in-python-3-5/ http://www.pythonsandbarracudas.com/blog/2015/11/22/developing-a-computational-pipeline-using-the-asyncio-module-in-python-3

如何提高此python请求会话的速度？

1 个答案: