无法尝试使用Scrapy刮取分页链接

时间:2016-07-14 11:44:52

标签: python pagination scrapy

我试图通过在分页的物业网站上获取条目的标题来学习Scrapy。我无法从' Next'中获取条目。 rules列表中定义的页面。

代码:

from scrapy import Spider
from scrapy.selector import Selector 
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import re

class VivastreetSpider(CrawlSpider):
    name = 'viva'
    allowed_domains = ['http://chennai.vivastreet.co.in/']
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
    rules = [
        Rule(LinkExtractor(restrict_xpaths = ('//*[text()[contains(., "Next")]]')), callback = 'parse_item', follow = True)
        ]

    def parse_item(self, response):
        a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract()
        i = 1
        for b in a:
            print('testtttttttttttttt ' + str(i) + '\n' + str(b))
            i += 1
        item = PropertyItem()
        item['title'] = a[0]
        yield item

编辑 - 用parse_item替换解析方法,现在不能刮掉任何东西。

最后忽略项目对象代码,我打算用另一个方法的请求回调替换它,该方法从每个条目的URL中获取更多细节。

如果需要,我会发布日志。

编辑#2-我从分页页面中提取了URL,然后向另一个方法发出请求,该方法最终从每个条目的页面中获取详细信息。 parse_start_url()方法正在运行但parse_item method()未被调用。

代码:

from scrapy import Request
from scrapy.selector import Selector 
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import sys

reload(sys)
sys.setdefaultencoding('utf8')  #To prevent UnicodeDecodeError, UnicodeEncodeError.

class VivastreetSpider(CrawlSpider):
    name = 'viva'
    allowed_domains = ['chennai.vivastreet.co.in']
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
    rules = [
        Rule(LinkExtractor(restrict_xpaths = '//*[text()[contains(., "Next")]]'), callback = 'parse_start_url', follow = True)
        ]   

    def parse_start_url(self, response):
        urls = Selector(response).xpath('//a[contains(@id, "vs-detail-link")][@href]').extract()    
        print('test0000000000000000000' + str(urls[0]))
        for url in urls:
            yield Request(url = url, callback = self.parse_item)

    def parse_item(self, response):
        #item = PropertyItem()
        a = Selector(response).xpath('//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"').extract()
        print('test tttttttttttttttttt ' + str(a))

1 个答案:

答案 0 :(得分:0)

你的蜘蛛很少有问题。

  1. 您的allowed_domains已损坏,如果您检查您的蜘蛛,您可能会收到很多已过滤的请求。

  2. 你在这里误解了CrawlSpider。首先,当CrawlSpider启动时,它会下载start_urls中的每个网址并调用parse_start_url

  3. 所以你的蜘蛛应该看起来像:

    class VivastreetSpider(CrawlSpider):
        name = 'test'
        allowed_domains = ['chennai.vivastreet.co.in']
        start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
        rules = [
        Rule(
            LinkExtractor(
            restrict_xpaths='//*[text()[contains(., "Next")]]'),
            callback='parse_start_url'
        )
        ]
    
        def parse_start_url(self, response):
            a = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/text()').extract()
            return {'test': len(a)}