scrapy SgmlLinkExtractor刮取Master和Detail页面

时间:2015-11-26 11:40:37

标签: python web-scraping scrapy scrapy-spider

我正在尝试从Listing和Detail页面中提取信息。 下面的代码正确地从列表页面和所有链接页面(其中包含Next)中删除审阅者信息 还会捕获detail_pages网址。例如http://www.screwfix.com/p/prysmian-6242y-twin-earth-cable-2-5mm-x-100m-grey/20967

但是,我无法看到如何导航并从详细信息页面中删除信息。

这里是否有人成功使用Scrapy可以帮助我完成这只蜘蛛?

感谢您的帮助。

我在下面包含蜘蛛的代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from hn_scraper.items import HnArticleItem


class ScrewfixSpider(Spider):
    name = "Screwfix"
    allowed_domains = ["www.screwfix.com"]
    start_urls = ('http://www.screwfix.com/', )

    link_extractor = SgmlLinkExtractor(
        allow=('www', ),
        restrict_xpaths=('//a[contains(., "Next")]', ))

    detail_page_extractor = SgmlLinkExtractor(
        allow=('www', ), 
        restrict_xpaths=('//tr[@id[contains(., "reviewer")]]/td[3]/a', ))

    def extract_one(self, selector, xpath, default=None):
        extracted = selector.xpath(xpath).extract()
        if extracted:
            return extracted[0]
        return default

    def parse(self, response):
        for link in self.link_extractor.extract_links(response):
            request = Request(url=link.url)
            request.meta.update(link_text=link.text)
            yield request

        for item in self.parse_item(response):
            yield item


    def parse_item(self, response):
        selector = Selector(response)

        rows = selector.xpath('//table[contains(.,"crDataGrid")]//tr[@id[contains(., "reviewer")]]')
        for row in rows:
            item = HnArticleItem()

            reviewer = row.xpath('td[3]/a')                     
            reviewer_url = self.extract_one(reviewer, './@href', '')
            reviewer_name = self.extract_one(reviewer, 'b/text()', '')
            total_reviews = row.xpath('td[4]/text()').extract()

            item['url'] = reviewer_url
            item['name'] = reviewer_name
            item['total_reviews'] = total_reviews       

            yield item


        detail_pages = self.detail_page_extractor.extract_links(response)

        if detail_pages:

            print 'detail_pages'
            print detail_pages[0].url

            yield Request(detail_pages[0].url)

0 个答案:

没有答案
相关问题