这是我的custom_filters.py文件:
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
添加了一行:
DUPEFILTER_CLASS = 'crawl_website.custom_filters.SeenURLFilter'
到settings.py
当我检查生成的csv文件时,它会多次显示一个url。这是错的吗?
答案 0 :(得分:2)
来自:http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
然后在settings.py
添加:
ITEM_PIPELINES = {
'your_bot_name.pipelines.DuplicatesPipeline': 100
}
修改强>
要检查重复的网址,请使用:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
if item['url'] in self.urls_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item
这需要您的商品中url = Field()
。像这样的东西(items.py):
from scrapy.item import Item, Field
class PageItem(Item):
url = Field()
scraped_field_a = Field()
scraped_field_b = Field()