抓取整个网站,但特定路径下的链接除外

时间:2016-02-19 20:39:54

标签: scrapy scrapy-spider

我有一只斗志旺盛的蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class ExampleSpider(CrawlSpider):
    name = "spidermaster"
    allowed_domains = ["www.test.com"]
    start_urls = ["http://www.test.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()),
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item'),
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)

我试图抓取整个网页,除了特定路径下的内容。

例如,我想抓取除www.test.com/too_much_links以外的所有测试网站。

提前致谢

1 个答案:

答案 0 :(得分:0)

我通常以这种方式这样做:

ignore = ['too_much_links', 'many_links']

rules = [Rule(SgmlLinkExtractor(allow=(), deny=ignore), follow=True),
         Rule(SgmlLinkExtractor(allow=(), deny=ignore), callback='parse_item'),
]