Question

我正在尝试抓取lynda.com课程并将其信息存储在csv文件中。这是我的代码

# -*- coding: utf-8 -*-
import scrapy
import itertools


class LyndadevSpider(scrapy.Spider):
    name = 'lyndadev'
    allowed_domains = ['lynda.com']
    start_urls = ['https://www.lynda.com/Developer-training-tutorials']

    def parse(self, response):
        #print(response.url)
        titles = response.xpath('//li[@role="presentation"]//h3/text()').extract()
        descs = response.xpath('//li[@role="presentation"]//div[@class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
        links = response.xpath('//li[@role="presentation"]/div/div/div[@class="col-xs-8 col-sm-9 card-meta-data"]/a/@href').extract()

        for title, desc, link in itertools.izip(titles, descs, links):
            #print link
            categ = scrapy.Request(link, callback=self.parse2)
            yield {'desc': link, 'category': categ}

    def parse2(self, response):
        #getting categories by storing the navigation info
        item = response.xpath('//ol[@role="navigation"]').extract()
        return item

我在这里要做的是，我正在抓取标题，教程列表的描述，然后导航到网址并抓取parse2中的类别。

但是，我得到的结果如下：

category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html

如何访问我想要的信息？

Answer 1

You need to yield a scrapy.Request in the parse method that parses the responses of start_urls (instead of yielding a dict). Also, I would rather loop over course items and extract the information for each course item separately.

I'm not sure what you mean exactly by categories. I suppose those are the tags you can see on the course details page at the bottom under Skills covered in this course. But I might be wrong.

Try this code:

# -*- coding: utf-8 -*-
import scrapy

class LyndaSpider(scrapy.Spider):
    name = "lynda"
    allowed_domains = ["lynda.com"]
    start_urls = ['https://www.lynda.com/Developer-training-tutorials']

    def parse(self, response):
        courses = response.css('ul#category-courses div.card-meta-data')
        for course in courses:
            item = {
                'title': course.css('h3::text').extract_first(),
                'desc': course.css('div.meta-description::text').extract_first(),
                'link': course.css('a::attr(href)').extract_first(),
            }
            request = scrapy.Request(item['link'], callback=self.parse_course)
            request.meta['item'] = item
            yield request

    def parse_course(self, response):
        item = response.meta['item']
        #item['categories'] = response.css('div.tags a em::text').extract()
        item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
        return item

在scrapy中解析其他请求的结果

1 个答案: