错误302在Scrapy中下载文件

时间:2016-05-21 21:07:31

标签: python scrapy

为什么我收到此错误?

files:
  "/etc/security/limits.conf":
    content: |
      *           soft    nofile          6144
      *           hard    nofile          6144
container_commands:
    01-worker-connections:
        command: "/bin/sed -i 's/worker_connections  1024/worker_connections  6144/g' /tmp/deployment/config/#etc#nginx#nginx.conf"

在我的浏览器中,URL似乎没有任何问题,而且302只是一个重定向。为什么没有scrapy只是按照重定向来下载文件?

[scrapy] WARNING: File (code: 302): Error downloading file from <GET <url> referred in <None>

3 个答案:

答案 0 :(得分:4)

我的解决方案是首先使用请求发送http请求,根据status_code选择要下载的网址,现在您可以将网址放在file_urls或自定义名称中。

import requests

def check_redirect(url):
    response = requests.head(url)
    if response.status_code == 302:
        url = response.headers["Location"]
    return url

或者您可以使用自定义文件管道

class MyFilesPipeline(FilesPipeline):

def handle_redirect(self, file_url):
    response = requests.head(file_url)
    if response.status_code == 302:
        file_url = response.headers["Location"]
    return file_url

def get_media_requests(self, item, info):
    redirect_url = self.handle_redirect(item["file_urls"][0])
    yield scrapy.Request(redirect_url)

def item_completed(self, results, item, info):
    file_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no images")
    item['file_urls'] = file_paths
    return item

我在这里使用了其他解决方案Scrapy i/o block when downloading files

答案 1 :(得分:3)

问题的根源似乎是pipelines/media.py中的代码:

   def _check_media_to_download(self, result, request, info):
        if result is not None:
            return result
        if self.download_func:
            # this ugly code was left only to support tests. TODO: remove
            dfd = mustbe_deferred(self.download_func, request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        else:
            request.meta['handle_httpstatus_all'] = True
            dfd = self.crawler.engine.download(request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        return dfd

具体来说,将handle_httpstatus_all设置为True的行会禁用下载的重定向中间件,从而触发错误。我会在scrapy github上询问原因。

答案 2 :(得分:1)

如果问题是重定向,则应在 settings.py 中添加以下内容:

MEDIA_ALLOW_REDIRECTS = True

来源:Allowing redirections in Scrapy