Question

我意识到，将CrawlSpider与LinkExtractor规则一起使用只会解析链接的页面，而不会解析起始页面本身。

例如，如果http://mypage.test包含指向http://mypage.test/cats/和http://mypage.test/horses/的链接，则搜寻器将解析猫和马页面而不解析http://mypage.test。这是一个简单的代码示例：

from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['http://mypage.test']

    rules = [
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    ]

    def parse_page(self, response):
        yield {
            'url': response.url,
            'status': response.status,
        }


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'ITEM_PIPELINES': {
        'pipelines.MyPipeline': 100,
    },
})
process.crawl(MySpider)
process.start()

我的目标是通过跟踪链接来解析网站中的每个页面。我该怎么做？

显然，CrawlSpider和LinkExtractor规则仅解析链接的页面，而不解析起始页面本身。

Answer 1

删除start_urls并添加：

def start_requests(self):
    yield Request('http://mypage.test', callback="parse_page")
    yield Request("http://mypage.test", callback="parse")

CrawlSpider使用self.parse提取和跟踪链接。

如何使用CrawlSpider通过跟踪链接来爬网整个网站？

1 个答案: