Question

我开始用Python编程。作为我的第一个项目，我想使用Scrapy创建Web爬虫 - Python模块。我遇到了自从2天以来一直在努力的问题而无法找到解决方案。任何帮助都将不胜感激。

我想从Allegro（波兰的ebay）抓取并废弃有关汽车价格的数据。我项目的第一阶段是下载汽车品牌列表和子类别（我想尽可能深入子类别）和报价数量。

我从网站开始抓取： http://allegro.pl/osobowe-pozostale-4058我可以点击左侧面板上的类别。到目前为止，我只关注左侧面板中的数据。

结果我想收到结构为的json文件：

{
    {"name": "BMW" # name
    "url": "http://allegro.pl/osobowe-bmw-4032" # link to subcategories
    "count": 12726 # numbers of offer
    "subcategories":[
        {  "name": "Seria 1" # name
        "url": "http://allegro.pl/bmw-seria-1-12435" # link to subcategories
        “count": 832 # numbers of offer
        }
        ,
        {another BMW model}
        ,
        …
        ]
     }
     ,
     {another car brand }
     ,
      …
}

由于某些品牌没有子类别，而且有些品牌有子类别的子类别，因此Web clawler必须非常灵活。有时候应该停在主页面，有时候会更深入地停在死胡同子类别。

BMV ->Seria 1 -> E87 (2004-2013)    vs  Acura (only 2 offers and no subcategories)

到目前为止，我能够创建第一个看起来像这样的蜘蛛

Items.py

import scrapy
class Allegro3Item(scrapy.Item):
    name=scrapy.Field()
    count=scrapy.Field()
    url = scrapy.Field()
    subcategory= scrapy.Field()

蜘蛛：

import scrapy

from allegro3.items import Allegro3Item

linki=[]

class AlegroSpider(scrapy.Spider):
    name = "AlegroSpider"
    allowed_domains = ["allegro.pl"]
    start_urls = ["http://allegro.pl/samochody-osobowe-4029"]

    def parse(self, response):

        global linki

        if response.url not in linki:
            linki.append(response.url)

            for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

                la = Allegro3Item()
                link = de.xpath('a/@href').extract()
                la['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
                la['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
                la['url'] = response.urljoin(link[0]).encode('utf-8')
                la['subcategory']=[]


                if la['url'] is not None:
                    if la['url'] not in linki:
                        linki.append(la['url'])

                        request = scrapy.Request(la['url'],callback=self.SearchFurther) 
                        #la['subcategory'].append(request.meta['la2'])
                yield la        

    def SearchFurther(self,response):
        global linki

        for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

            link = de.xpath('a/@href').extract()
            la2 = Allegro3Item()
            la2['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
            la2['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
            la2['url'] = response.urljoin(link[0]).encode('utf-8')

            yield la2

在这段代码中，我试图用：

创建class / iteam

品牌名称
优惠数量
链接到subateogry
具有与第1-4点相同数据的子类别元素列表

我遇到问题4.当我创建addl请求'SearchFurther'时。

request = scrapy.Request(la['url'],callback=self.SearchFurther)

我不知道如何将la2项作为SearchFurther的结果传递给上一个请求，所以我可以将la2作为列表的附加元素附加到la [subcategory]（一个品牌可以有很多子类别）

如果有任何帮助，我将不胜感激。

Answer 1

请查看此文档：http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

在某些情况下，您可能有兴趣将参数传递给那些回调函数，以便稍后在第二个回调中接收参数。您可以使用Request.meta属性。

使用Scrapy进行递归网络爬网

1 个答案: