如何将scrapy图像下载到动态文件夹中?

时间:2014-12-09 18:46:38

标签: python scrapy

我可以通过scrapy将图像下载到“Full”文件夹中,但每次scrapy运行时,我都需要使目标文件夹的名称动态,如full/session_id

有没有办法做到这一点?

2 个答案:

答案 0 :(得分:2)

我尚未使用ImagesPipeline,但是following the documentation,我会覆盖item_completed(results, items, info)

最初的定义是:

def item_completed(self, results, item, info):
    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

这应该为您提供下载图像的结果集,包括路径(似乎一个项目上可能有很多图像)。

如果您现在在子类中更改此方法以在设置路径之前移动所有文件,则它应该可以按您的方式工作。您可以在项目中设置目标文件夹,例如item['session_path']。在从蜘蛛中返回/放弃物品之前,您必须在每个项目上设置此设置。

使用overriden方法的子类可能如下所示:

import os, os.path
from scrapy.contrib.pipeline.images import ImagesPipeline

class SessionImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # iterate over the local file paths of all downloaded images
        for result in [x for ok, x in results if ok]:
            path = result['path']
            # here we create the session-path where the files should be in the end
            # you'll have to change this path creation depending on your needs
            target_path = os.path.join((item['session_path'], os.basename(path)))

            # try to move the file and raise exception if not possible
            if not os.rename(path, target_path):
                raise ImageException("Could not move image to target folder")

            # here we'll write out the result with the new path,
            # if there is a result field on the item (just like the original code does)
            if self.IMAGES_RESULT_FIELD in item.fields:
                result['path'] = target_path
                item[self.IMAGES_RESULT_FIELD].append(result)

        return item

更好的方法是将所需的会话路径设置为item,而不是在scrapy运行期间的配置中。为此,你必须找到在应用程序运行时如何设置配置,我认为你必须覆盖构造函数。

答案 1 :(得分:0)

这是答案stackoverflow.com

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)