Question

我对scrapy相对较新，并且获得了很多例外...... 这是我想要做的：

我想从中获取数据的4个嵌套链接：让我们说我有5件我想要抓取的项目。这些项目是

Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()

现在开始抓取我首先必须得到行业。 Industry xpath还包含属于其行业细分的公司的各个列表的链接。
接下来我想使用Industry xpath并进入链接。此页面不包含我要抓取的任何数据。但是，此页面包含指向具有各自基本信息页面的各个公司的href链接。
使用列表页面中的href链接，我现在到达一个页面，其中包含一家公司的信息。现在我要抓公司，地址和网站。我需要单击另一个href链接，以便指向Contact_First_Name，Contact_Last_Name。
使用href链接，我现在到达另一个包含Contact_First_Name和Contact_Last_Name的页面

After crawling all of these pages, I should have items that look somewhat like this: Industry Company Website Contact_First_Name Contact_Last_Name Finance JPMC JP.com Jamie Dimon Finance BOA BOA.com Bryan Moynihan Technology ADSK ADSK.com Carl Bass

EDITED

以下是正在运行的代码。 Anzel的建议确实有所帮助，但我意识到子类allowed_domains是错误的，这阻止了嵌套链接的跟进。一旦我改变它，它的工作原理。

class PschamberSpider(scrapy.Spider):
    name="pschamber"
    allowed_domains = ["cm.pschamber.com"]
    start_urls = ["http://cm.pschamber.com/list/"]


    def parse(self, response):
        item = PschamberItem()
        for sel in response.xpath('//*[@id="mn-ql"]/ul/li/a'):
            # xpath and xpath().extract() will return a list
            # extract()[0] will return the first element in the list
            item['Industry'] = sel.xpath('text()').extract()
            # another mistake you made here
            # you're trying to call scrapy.Request(LIST of hrefs) which will fail
            # scrapy.Request only takes a url string, not list
            # another big mistake is you're trying to yield the item,
            # whereas you should yield the Request object
            yield scrapy.Request(sel.xpath('@href').extract()[0], callback=self.parse_2, meta={'item': item})

    # another mistake, your callback function DOESNT take item as argument
    def parse_2(self, response):
        for sel in response.xpath('.//*[@id="mn-members"]/div/div/div/div/div/a').extract():
            # you can access your response meta like this
            item=response.meta['item']
            item['Company'] = sel.xpath('text()').extract()
            yield scrapy.Request(sel.xpath('@href').extract()[0], callback=self.parse_3, meta={'item': item})

            # again, yield the Request object


    def parse_3(self, response):
        item=response.meta['item']
        item['Website'] = response.xpath('.//[@id="mn-memberinfo-block-website"]/a/@href').extract()
        # OK, finally assume you're done, just return the item object
        return item

Answer 1

您在代码中犯了很多错误，因此它没有按预期运行。请参阅我的以下简要示例，了解如何获取所需的项，并将 meta 传递给其他回调。我没有复制你的xpath，因为我只是从网站上获取最直接的一个，你可以申请自己的。

我会尽可能明确地发表评论，让你知道你哪里做错了。

class PschamberSpider(scrapy.Spider):
    name = "pschamber"
    # start from this, since your domain is a sub-domain on its own,
    # you need to change to this without http://
    allowed_domains = ["cm.pschamber.com"]
    start_urls = (
        'http://cm.pschamber.com/list/',
    )

    def parse(self, response):
        item = PschamberItem()
        for sel in response.xpath('//div[@id="mn-ql"]//a'):
            # xpath and xpath().extract() will return a list
            # extract()[0] will return the first element in the list
            item['industry'] = sel.xpath('text()').extract()[0]

            # another mistake you made here
            # you're trying to call scrapy.Request(LIST of hrefs) which will fail
            # scrapy.Request only takes a url string, not list
            # another big mistake is you're trying to yield the item,
            # whereas you should yield the Request object
            yield scrapy.Request(
                sel.xpath('@href').extract()[0],
                callback=self.parse_2,
                meta={'item': item}
            )
    # another mistake, your callback function DOESNT take item as argument
    def parse_2(self, response):
        for sel in response.xpath('//div[@class="mn-title"]//a'):
            # you can access your response meta like this
            item = response.meta['item']
            item['company'] = sel.xpath('text()').extract()[0]
            # again, yield the Request object
            yield scrapy.Request(
                sel.xpath('@href').extract()[0],
                callback=self.parse_3,
                meta={'item': item}
            )

    def parse_3(self, response):
        item = response.meta['item']
        item['website'] = response.xpath('//a[@class="mn-print-url"]/text()').extract()
        # OK, finally assume you're done, just return the item object
        return item

希望这是不言自明的，你要理解 scrapy 的基本原则，你应该阅读彻底查看来自Scrapy的文档，并且你很快就会学习另一种方法来设置规则以跟随某些模式的链接......当然，一旦你获得基本权利，你就会理解它们。

尽管每个人的旅程都有所不同，但我强烈建议您继续阅读和练习，直到您在抓取实际网站之前对自己所做的事情充满信心。此外，还有一些规则可以保护可以删除的网页内容，以及有关您抓取的内容的版权。

请记住这一点，否则您将来可能会遇到大麻烦。无论如何，祝你好运，我希望这个答案可以帮助你解决问题！

Scrapy - 访问嵌套链接并从每个级别获取元数据

1 个答案: