Scrapy只抓取网站的一部分

时间:2014-07-17 13:49:10

标签: hyperlink scrapy web-crawler

您好,我有以下代码来扫描给定网站中的所有链接。

from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class SampleItem(Item):
    link = Field()


class SampleSpider(CrawlSpider):
    name = "sample_spider"
    allowed_domains = ["domain.com"]
    start_urls = ["http://domain.com"]

    rules = (
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = SampleItem()
        item['link'] = response.url
        return item

如果我只想检查全球网站的一部分,我该怎么做?例如,我试图仅扫描其域名结构为:domain.com/fr/fr的国际站点的法语部分。所以我尝试过:

from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class SampleItem(Item):
    link = Field()


class SampleSpider(CrawlSpider):
    name = "sample_spider"
    allowed_domains = ["domain.com/fr/fr"]
    start_urls = ["http://domain.com/fr/fr"]

    rules = (
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = SampleItem()
        item['link'] = response.url
        return item

但蜘蛛只返回3个结果而不是数千个结果。我究竟做错了什么?

1 个答案:

答案 0 :(得分:1)

要仅抓取网站的一部分,您必须使用LinkExtractor。您可以通过发出scrapy genspider -t crawl domain domain.com来获取示例。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from test.items import testItem


class DomainSpider(CrawlSpider):
    name = 'domain'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/fr/fr']

    rules = (
        Rule(LinkExtractor(allow=r'fr/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = testItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i
相关问题