Scrapy - 如何获得重复的请求引用

时间:2016-09-21 02:36:23

标签: python python-2.7 web-scraping scrapy scrapy-settings

当我打开let(:gem_widgets) do instance_double(GemObjectMethod1ResultClass, gemObjectMethod2: true) end before do allow(subject).to receive(:widgets).and_return(gem_widgets) allow(gem_widgets).to receive(:gemObjectMethod2).with("sample"). and_return([]) end it 'should pass with "sample"' do expect(actual).to eql true end 时,我得到了:

  

2016-09-21 01:48:29 [scrapy] DEBUG:已过滤的重复请求:http://www.example.org/example.html>

问题是,我需要知道重复请求的引用来调试代码。如何调试引荐来源?

1 个答案:

答案 0 :(得分:1)

一个选项是基于内置RFPDupeFilter过滤器的自定义过滤器

from scrapy.dupefilters import RFPDupeFilter

class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        self.logger.debug(request.headers.get("REFERER"), extra={'spider': spider})
        super(MyDupeFilter, self).log(request, spider)

不要忘记将DUPEFILTER_CLASS setting设置为指向您的自定义类。

(未经测试)