scrapy re.match not working使用正则表达式在字符串中查找url

时间:2015-04-30 07:06:38

标签: regex scrapy scrapy-spider

我尝试在同一个域中抓取多个网址。我必须在字符串中列出url列表。我想在字符串中搜索正则表达式并找到网址。但是re.match()总是不返回。我测试我的正则表达式并且它正常工作。这是我的代码:

# -*- coding: UTF-8 -*-

import scrapy
import codecs 
import re

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy import Request

from scrapy.selector import HtmlXPathSelector

from hurriyet.items import HurriyetItem

class hurriyet_spider(CrawlSpider):
    name = 'hurriyet'
    allowed_domains = ['hurriyet.com.tr']
    start_urls = ['http://www.hurriyet.com.tr/gundem/']

    rules = (Rule(SgmlLinkExtractor(allow=('\/gundem(\/\S*)?.asp$')),'parse',follow=True),) 

    def parse(self, response):
        image = HurriyetItem()
        text =  response.xpath("//a/@href").extract()
        print text

        urls = ''.join(text)


        page_links = re.match("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))", urls, re.M)

        image['title'] = response.xpath("//h1[@class = 'title selectionShareable'] | //h1[@itemprop = 'name']/text()").extract()
        image['body'] = response.xpath("//div[@class = 'detailSpot']").extract()
        image['body2'] = response.xpath("//div[@class = 'ctx_content'] ").extract()
        print page_links

        return image, text

1 个答案:

答案 0 :(得分:0)

无需使用re模块,Scrapy选择器有built in feature for regex filtering

def parse(self, response):
        ...
        page_links = response.xpath("//a/@href").re('your_regex_expression')
        ...

话虽如此,我建议你首先在Scrapy shell中使用这种方法,以确保你的正则表达式确实有效。因为我不希望人们尝试调试一英里长的正则表达式 - 它基本上是一种只写的语言:)