scrapy拒绝本地化的网址

时间:2017-11-11 14:18:04

标签: python regex localization scrapy

我试图拒绝本地化的URL,如下所示:

rules = (
    Rule(LinkExtractor(deny=(r'\/es\/')), follow = True)
)

然而这失败了。尝试了以下其他正则表达但不是运气。

rules = (
    Rule(LinkExtractor(deny=(r'\/es\/*.*')), follow = True)
)

基本上我只对该资源的英文版感兴趣。不是西班牙语版本,即:它在URL中有/es/

如何确保我不抓取西班牙语网址?

1 个答案:

答案 0 :(得分:0)

像你一样在你的蜘蛛中定义你的中间件

class MySpider(scrapy.Spider):
    name = "my_spider"  

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'project_root_path.MyMiddlewaresFile.MyMiddleware': 300,

        }
    }


    def start_requests(self):

        yield Request()

并在MyMiddlewaresFile.py

class MyMiddleware(object):

    def process_request(self, request, spider):
        if "/en/" in request.url:
            pass #Do not do anything.

        else:
            #keep processing request
            return request

请参阅文档:https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request