Question

我的问题是：我在主页面上列出了一个列表（html - li），列表中的每个组件都要输入另一个页面，获取一些信息，将它们放在一个项目元素中，并与主页面列表中的其他antoher元素进行交互（html - li）。我已经完成了第一个代码，但我是Python，Scrapy的新手，并且我发现了制作代码的一些难题。

我得到了这个解决方案，但它为每个主列表元素生成了两个项目。

class BoxSpider(scrapy.Spider):
    name = "mag"
    start_urls = [
        "http://www.example.com/index.html"
    ]

    def secondPage(self, response):
        secondPageItem = CinemasItem()
        secondPageItem['trailer'] = 'trailer'
        secondPageItem['synopsis'] = 'synopsis'
        yield secondPageItem

    def parse(self, response):

        for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'):

            item = CinemasItem()
            item['title'] = 'title'
            item['room'] = 'room'
            item['mclass'] = 'mclass'
            item['minAge'] = 'minAge'
            item['cover'] = 'cover'
            item['sessions'] = 'sessions'

            secondUrl = sel.xpath('p[1]/a/@href').extract()[0]

            yield item
            yield scrapy.Request(url=secondUrl, callback=self.secondPage)

有些人可以帮我生成一个项目元素，其中包括＆＃39; title＆＃39;，＆＃39; room＆＃39;，＆＃39; mclass＆＃39;，＆＃39; minAge＆＃39; ，＆＃39; cover＆＃39; sessions＆＃39;，＆＃39;预告片＆＃39;＆＃39;简介＆＃39;田野填满了？而不是一个带有＆＃39; title＆＃39;，＆＃39; room＆＃39;，＆＃39; mclass＆＃39;，＆＃39; minAge＆＃39;，＆＃39; cover＆＃39;，＆＃39;会议＆＃39;填写的字段和其他与“预告片”，“简介”和“＃39;简介＆＃39;填充？

Answer 1

您需要将item中parse()实例化的secondPage传递给def parse(self, response): for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'): item = CinemasItem() item['title'] = 'title' item['room'] = 'room' item['mclass'] = 'mclass' item['minAge'] = 'minAge' item['cover'] = 'cover' item['sessions'] = 'sessions' secondUrl = sel.xpath('p[1]/a/@href').extract()[0] # see: we are passing the item inside the meta yield scrapy.Request(url=secondUrl, meta={'item': item}, callback=self.secondPage) def secondPage(self, response): # see: we are getting the item from meta item = response.meta['item'] item['trailer'] = 'trailer' item['synopsis'] = 'synopsis' yield item回调：

{{1}}

另见：

meta

递归爬网页面

1 个答案: