回归基础:Scrapy

时间:2015-11-13 18:35:29

标签: python scrapy

scrapy的新手,我绝对需要指针。我已经完成了一些例子,但我没有得到一些基础知识。我正在运行scrapy 1.0.3

蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem


class MySpider(BaseSpider):
    name = "matrix"
    allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
    start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = MatrixScrapeItem()
        item['backers'] = hxs.select("//*[@id="backers_count"]/data").extract()
        item['totalPledged'] = hxs.select("//*[@id="pledged"]/data").extract()
        print backers, totalPledged

项:

import scrapy


class MatrixScrapeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    backers = scrapy.Field()
    totalPledged = scrapy.Field()

    pass

我收到了错误:

File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
    item['backers'] = hxs.select("//*[@id="backers_count"]/data").extract()

问题是:为什么选择和提取工作不正常?我确实看到人们只使用Selector而不是HtmlXPathSelector。

此外,我正在尝试将其保存到csv文件并根据时间自动化(每30分钟提取一次这些数据点)。如果有人有任何关于这方面的例子,他们会得到超级布朗尼点:)

1 个答案:

答案 0 :(得分:2)

语法错误是由使用双引号的方式引起的。混合单引号和双引号:

item['backers'] = hxs.select('//*[@id="backers_count"]/data').extract()
item['totalPledged'] = hxs.select('//*[@id="pledged"]/data').extract()

作为旁注,您可以使用response.xpath()快捷方式而不是实例化HtmlXPathSelector

def parse(self, response):
    item = MatrixScrapeItem()
    item['backers'] = response.xpath('//*[@id="backers_count"]/data').extract()
    item['totalPledged'] = response.xpath('//*[@id="pledged"]/data').extract()
    print backers, totalPledged

您可能想要获得text()元素的data

//*[@id="backers_count"]/data/text()
//*[@id="pledged"]/data/text()