Scrapy-递归爬网多个页面时避免重复项

时间:2019-02-23 14:30:03

标签: python scrapy web-crawler

我应该在代码中进行哪些更改,以避免在深度爬网到多个页面期间Scrapy检索相同的项目?

现在,Scrapy会像这样进行抓取和抓取

Visit Page-A >> ScrapeItem1 & Extract_link_to_Page-B >> Visit Page-B >> ScrapeItem2 & Extract_links_to_Pages-C-D-E >> ScrapeItems2-3-4-5 from Pages-C-D-E

代码看起来像这样

def category_page(self,response):
         next_page = response.xpath('').extract()

         for item in self.parse_attr(response):
             yield item

         if next_page:
             path = next_page.extract_first()
             nextpage = response.urljoin(path)
             yield scrapy.Request(nextpage,callback=category_page)

    def parse_attr(self, response):
        item = TradeItem()
        item['NameOfCompany'] = response.xpath('').extract_first().strip()
        item['Country'] = response.xpath('').extract_first().strip()
        item['TrustPt'] = response.xpath('').extract_first().strip()
        company_page = response.xpath('').extract_first()

        if company_page:
            company_page = response.urljoin(company_page)
            request = scrapy.Request(company_page, callback = self.company_data)
            request.meta['item'] = item
            yield request
        else:
            yield item

    def company_data(self, response):
        item = response.meta['item']
        item['Address'] = response.xpath('').extract()[1]
        product_page = response.xpath('').extract()[1]
        sell_page = response.xpath('').extract()[2]
        trust_page = response.xpath('').extract()[4]       

        if sell_page:
            sell_page = response.urljoin(sell_page)
            request = scrapy.Request(sell_page, callback = self.sell_data)
            request.meta['item3'] = item
            yield request   
        if product_page:
            product_page = response.urljoin(product_page)
            request = scrapy.Request(product_page, callback = self.product_data)
            request.meta['item2'] = item
            yield request
        if trust_page:
            trust_page = response.urljoin(trust_page)
            request = scrapy.Request(trust_page, callback = self.trust_data)
            request.meta['item4'] = item
            yield request           

        yield item

    def product_data(self, response):
        item = response.meta['item2']
        item ['SoldProducts'] = response.xpath('').extract()    
        yield item

    def sell_data(self, response):
        item = response.meta['item3']
        item ['SellOffers'] = response.xpath('').extract()
        yield item

    def trust_data(self, response):
        item = response.meta['item4']
        item ['TrustData'] = response.xpath('').extract()
        yield item

问题是重复项目,因为Scrapy对每个功能/元项目执行PARTIAL抓取。所以,我得到这样的条目:

第一步:

{'Address': u'',
 'Country': u'',
 'NameOfCompany': u'',
 'TrustPoints': u''}

第二步:

{'Address': u'',
 'Country': ','
 'NameOfCompany': ',
 'SellOffers': [
 'TrustPoints': u''}

Step3:

{'Address': u'',
 'Country': u'',
 'NameOfCompany': u'',
 'SellOffers': [],
 'SoldProducts': [u' '],
 'TrustData': [u''],
 'TrustPoints': u''}

每个STEP重复上一个的值。我知道这是由Scrapy多次访问URL引起的。我的逻辑有些错误,我无法完全把握。

1 个答案:

答案 0 :(得分:0)

问题解决了。

相应的答案:

https://stackoverflow.com/a/16177544/11008259

我的案例代码已更正。

    def parse_attr(self, response):
        company_page = response.xpath('').extract_first()

        company_page = response.urljoin(company_page)
        request = scrapy.Request(company_page, callback = self.company_data)
        yield request

    def company_data(self, response):
        item = TradekeyItem()
        item['Address'] = response.xpath('').extract()[1]
        item['NameOfCompany'] = response.xpath('').extract()[1]

        product_page = response.xpath('').extract()[1]

        product_page = response.urljoin(product_page)
        request = scrapy.Request(product_page, callback = self.product_data, meta={'item': item})
        request.meta['item'] = item
        return request

    def product_data(self, response):
        item = response.meta['item']
        item ['SoldProducts'] = response.xpath('').extract()
        sell_page = response.xpath('').extract()[2]
        sell_page = response.urljoin(sell_page)
        request = scrapy.Request(sell_page, callback = self.sell_data, meta={'item': item})
        return request

    def sell_data(self, response):
        item = response.meta['item']
        item ['SellOffers'] = response.xpath('').extract()
        trust_page = response.xpath('').extract()[4]       
        trust_page = response.urljoin(trust_page)
        request = scrapy.Request(trust_page, callback = self.trust_data, meta={'item': item})
        return request

    def trust_data(self, response):
        item = response.meta['item']
        item ['TrustData'] = response.xpath('")]//text()').extract()
        yield item

我们在项目之间建立链,方法是在每个步骤中不产生项目,而在最后一步产生项目。每个功能都将请求返回给下一个功能,因此,仅当所有功能完成运行后才打印项目。