使用gevent下载多个文件

时间:2014-05-23 07:20:53

标签: python asynchronous download urllib2 gevent

我正在尝试并行下载文件列表,并使用[gevent][1]

我的代码稍微修改了建议的代码here

monkey.patch_all()

def download_xbrl_files(download_folder, yq, list_of_xbrl_urls):
    def download_and_save_file(url, yr, qtr):
        if url is not None:
            full_url = "http://edgar.sec.gov" + url
            if not os.path.exists(full_url):
                try:
                    content = urllib2.urlopen(full_url).read()
                    filename = download_folder + "/" + str(y) + "/" + q + "/" + url.split('/')[-1]
                    print "Saving: ", filename
                    f_raw = open(filename, "w")
                    f = FileObject(f_raw, "w")
                    try:
                        f.write(content)
                    finally:
                        f.close()
                        return 'Done'
                except:
                    print "Warning: can't save or access for item:", url
                    return None
            else:
                return 'Exists'
        else:
            return None
    (y, q) = yq
    if utls.has_elements(list_of_xbrl_urls):
        filter_for_none = filter(lambda x: x is not None, list_of_xbrl_urls)
        no_duplicates = list(set(filter_for_none))
        download_files = [gevent.spawn(lambda x: download_and_save_file(x, y, q), x) for x in no_duplicates]
        gevent.joinall(download_files)
        return 'completed'
    else:
        return 'empty'

代码的作用是:

  1. 经过一些清洁
  2. gevent.spawn产生download_and_save_file,其中:
  3. 检查文件是否已下载
  4. 如果没有,请使用urllib2.urlopen(full_url).read()
  5. 下载内容
  6. gevent's FileObject
  7. 的帮助下保存文件

    我的印象是download_and_save只能按顺序工作。此外,我的应用程序处于备用状态。我可以添加timeout,但我想在我的代码中优雅地处理失败。

    想知道我做错了什么 - 这是我第一次用python编写代码。

    修改

    以下是使用“Threads”

    的代码版本
    def download_xbrl_files(download_folder, yq_and_url):
        (yq, url) = yq_and_url
        (yr, qtr) = yq
        if url is not None and url is not '':
            full_url = "http://edgar.sec.gov" + url
            filename = download_folder + "/" + str(yr) + "/" + qtr + "/" + url.split('/')[-1]
            if not os.path.exists(filename):
                try:
                    content = urllib2.urlopen(full_url).read()
                    print "Saving: ", filename
                    f = open(filename, "wb")
                    try:
                        f.write(content)
                        print "Writing done: ", filename
                    finally:
                        f.close()
                        return 'Done'
                except:
                    print "Warning: can't save or access for item:", url
                    return None
            else:
                print "Exists: ", filename
                return 'Exists'
        else:
            return None
    
    
    def download_filings(download_folder, yq_and_filings):
        threads = [threading.Thread(target=download_xbrl_files, args=(download_folder, x,)) for x in yq_and_filings]
        [thread.start() for thread in threads]
        [thread.join() for thread in threads]
    

1 个答案:

答案 0 :(得分:1)

我更深入地研究了这个问题,gevent.spawn()创建了greenlets而不是进程(所有greenlet都在一个OS线程中运行)。

尝试一下简单:

import gevent
from time import sleep
g = [gevent.spawn(sleep, 1) for x in range(100)]
gevent.joinall(g)

你会看到这个时间是100秒。这证明了上述观点。

您真的在寻找可以在线程模块中找到的多线程。 请查看以下问题:How to use threading in Python?。稍微如何。

--- ---更新

以下是如何执行此操作的简单示例:

threads = [threading.Thread(target=sleep, args=(1,)) for x in range(10)]
[thread.start() for thread in threads]
[thread.join() for thread in threads]