Scrapy,从第1页解析项目,然后点击链接获取其他项目

时间:2016-02-03 01:03:26

标签: python callback scrapy scrapy-spider

更新:我能够让这个移动,但它没有返回到子页面并再次迭代序列。 我想要提取的数据如下表所示:

表 date_1 | source_1 |链接到article_1 | date_2 | source_2 |链接到article_2 | 等....

我需要先收集date_1,source_1,然后进入该文章的链接并重复...

非常感谢任何帮助。 :)

from scrapy.spiders import BaseSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors import LinkExtractor
from dirbot.items import WebsiteLoader
from scrapy.http import Request
from scrapy.http import HtmlResponse



class DindexSpider(BaseSpider):
name = "dindex"
allowed_domains = ["newslookup.com"]
start_urls = [
      "http://www.newslookup.com/Business/"
]

def parse_subpage(self, response):
    self.log("Scraping: " + response.url)
    il = response.meta['il']
    time = response.xpath('//div[@id="update_data"]//td[@class="stime3"]//text()').extract()
    il.add_value('publish_date', time)
    yield il.load_item()


def parse(self, response):
    self.log("Scraping: " + response.url)
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//td[@class="article"]')

    for site in sites:
        il = WebsiteLoader(response=response, selector=site)
        il.add_xpath('name', 'a/text()')
        il.add_xpath('url', 'a/@href')
        yield Request("http://www.newslookup.com/Business/", meta={'il': il}, callback=self.parse_subpage)

1 个答案:

答案 0 :(得分:0)

That's just because you need to use the CrawlSpider class instead of the BaseSpider:

from scrapy.spiders import CrawlSpider

class DindexSpider(CrawlSpider):
    # ...