Scrapy CrawlSpider不遵循链接

时间:2015-11-29 10:14:48

标签: python scrapy scrapy-spider

我尝试抓取使用下一个按钮的页面,使用scrapy移动到新页面。我使用了爬行蜘蛛的实例,并定义了Linkextractor来提取要关注的新页面。然而,蜘蛛只是抓住起始网址并停在那里。我已经添加了蜘蛛代码和日志。任何人都知道为什么蜘蛛无法抓取页面。

        from scrapy.spiders import CrawlSpider, Rule
        from scrapy.linkextractors import LinkExtractor
        from realcommercial.items import RealcommercialItem
        from scrapy.selector import Selector
        from scrapy.http import Request

        class RealCommercial(CrawlSpider):
            name = "realcommercial"
            allowed_domains = ["realcommercial.com.au"]
            start_urls = [
                "http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
        ]
            rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),

                           callback='parse_response',
                           process_links='process_links',
                           follow=True),
                     Rule(LinkExtractor( allow = []),

                           callback='parse_response',
                           process_links='process_links',
                           follow=True)]


            def parse_response(self, response):        
                sel = Selector(response)
                sites = sel.xpath("//a[@class='details']")
                #items = []
                for site in sites:
                    item = RealcommercialItem()
                    link = site.xpath('@href').extract()
                    #print link, '\n\n'
                    item['link'] = link
                    link = 'http://www.realcommercial.com.au/' + str(link[0])
                    #print 'link!!!!!!=', link
                    new_request = Request(link, callback=self.parse_file_page)
                    new_request.meta['item'] = item
                    yield new_request
                    #items.append(item)
                yield item
                return

            def process_links(self, links):
                print 'inside process links'
                for i, w in enumerate(links):
                    print w.url,'\n\n\n'
                    w.url = "http://www.realcommercial.com.au/" + w.url
                    print w.url,'\n\n\n'
                    links[i] = w

                return links

            def parse_file_page(self, response):
                #item passed from request
                #print 'parse_file_page!!!'
                item = response.meta['item']
                #selector
                sel = Selector(response)
                title = sel.xpath('//*[@id="listing_address"]').extract()
                #print title
                item['title'] = title

                return item

日志

                2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
                2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
                o
                2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
                alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
                spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
                2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
                , TelnetConsole, LogStats, CoreStats, SpiderState
                2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
                eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
                eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
                leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
                2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
                re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
                2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
                2015-11-29 15:42:57 [scrapy] INFO: Spider opened
                2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
                tems (at 0 items/min)
                2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
                2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
                .com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
                ist-date> (referer: None)
                2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
                2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
                {'downloader/request_bytes': 303,
                 'downloader/request_count': 1,
                 'downloader/request_method_count/GET': 1,
                 'downloader/response_bytes': 30599,
                 'downloader/response_count': 1,
                 'downloader/response_status_count/200': 1,
                 'finish_reason': 'finished',
                 'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
                 'log_count/DEBUG': 2,
                 'log_count/INFO': 7,
                 'response_received_count': 1,
                 'scheduler/dequeued': 1,
                 'scheduler/dequeued/memory': 1,
                 'scheduler/enqueued': 1,
                 'scheduler/enqueued/memory': 1,
                 'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
                2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

我自己得到了答案。有两个问题:

  1. process_links是&#34; http://www.realcommercial.com.au/&#34;虽然它已经存在了。我以为它会回馈相对网址。
  2. 链接提取器中的正则表达式不正确。
  3. 我对这两个进行了更改,但它确实有效。