Question

我正在使用scrapy 0.20和python 2.7

我想避免重复的项目。

我不想在命令行中使用JOBDIR作为参数。相反，我在我的脚本中这样做：

settings.overrides['JOBDIR']= 'my customer jobdir'

然后我在我的设置中执行此操作：

DUPEFILTER_CLASS = 'MyProject.CustomFilter.CustomFilter'

在CustomFilter我起诉了这个：

def request_seen(self, request):
        fp = self.__getid(request.url)
        if (fp is not None) and (fp in self.fingerprints):
            return True
        elif fp is not None:
            self.fingerprints.add(fp)
            if self.file:
                self.file.write(fp + os.linesep)
        else:
            return False

其中__getid是我使用过的辅助函数。

我的问题

当蜘蛛找到第一个重复的项目时停止工作。

我在CMD上发现了这条消息：

2014-03-03 10:43:44-0800 [GeneralSpider] DEBUG: Filtered duplicate request: <GET
 http://www.justproperty.com/apartments/old-town/1057362-most-affordable-2-b-r-i
n-old-town-for-sale.html> - no more duplicates will be shown (see DUPEFILTER_CLA
SS)

Answer 1

您可以在请求调用中使用参数dont_filter=True。这将指示scrapy不会忽略重复请求。记录在案here

Answer 2

在settings.py中启用DUPEFILTER_DEBUG = True

现在，调度程序中的重复过滤器会筛选出在单个蜘蛛运行中已经看到的所有URL（意味着它将在后续运行中重置）。

如果要继续抓取，请忽略重复的网址。 IgnoreVistedItems中间件在运行之间保持状态并避免访问过去看到的URL，但仅限于最终项目URL，以便可以重新爬网站点的其余部分（以便查找新项目）。希望这有助于某人。

scrapy spider停在第一个重复的项目上

我的问题

2 个答案: