收集最高社区网站的统计信息

时间:2019-02-27 20:24:27

标签: python scrapy

我正在尝试通过使用Python中的Scrapy库从最高社区网站上获取标题,价格和赞成/反对投票的统计信息。

import scrapy


class SupremeSpider(scrapy.Spider):
    name = "Supreme"
    start_urls = [
        'https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/'
    ]

    def parse(self, response):
        for data in response.css('div.card-details'):
            yield {
                'title': data.xpath("//h2/text()").getall(),
                'price': data.css('span.label-price::text').get()     
                #'upvotes': data.xpath("//p/text()").getall()
                #'downvotes': quote.css('div.tags a.tag::text').getall(),
            }

当我在CMD中运行抓取抓取Supreme时

结果如下:

  

2019-02-27 14:19:09 [scrapy.core.scraper]调试:从<200刮下来   https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/>   {'title':['Airbrushed Floral Skateboard','Formula Crewneck',   'Supreme®/ MasterLock®数字组合锁','Supreme®/ SIGG™CYD   1.0升水瓶”,“腰包”,“爬行者T恤”,“粉碎T恤”,“ FREE GIFT 浴帽”,“ Christopher Walken纽约之王T恤”,“盘子”   毛巾(三件套)”,“金属打火机皮套”,“粘结徽标浮肿”   夹克”,“单肩包”,“雪尼尔连帽运动衫”,“背包”,   “套色无檐小便帽”,“水果T恤”,“结T恤”,“整理袋”,   'Supreme®/Hanes®豹纹平角内裤(2件装)','行李袋','   Real Shit L / S Tee”,“ Red Rum棒球服”,“Supreme®/Hanes®拳击手”   内裤(4件装)”,“儿童T恤”,“玩具Uzi充气枕头”,“苹果”   连帽运动衫”,“ Spotlight钥匙扣”,“Supreme®/Hanes®船员袜”   (4包)”,“带缝线夹克”,“前三通”,“水果滑板”,   “ Hard Goods Tee”,“ Leda And The Swan Tee”,“ Military Camp Cap”,   'Leather Varsity Jacket','Patchwork Harrington Jacket','Formula   Sweatpant”,“Supreme®/Hanes®无标签T恤(3件装)”,“ I Make Shit Shi”   Happen Pin”,“ Leda和天鹅滑板”,“ Sin Tee原创”,   “ Clouds L / S上衣”,“赛车徽标工作衬衫”,“真丝迷彩衬衫”,   “自由女神吊坠”,“色欲陶瓷盒子”,“管道”   夹克”,“拼布马海毛开襟衫”,“Supreme®/Hanes®豹纹无标签”   “ T恤(2包)”,“徽标徽标连帽套头运动衫”,“Supreme®/Spitfire®”   经典车轮(4个一组)”,“世界三通中指”,“ S / S”   Pocket Tee”,“Supreme®/Independent®Truck”,“ GORE-TEX S-Logo 6-Panel”,   'Tag Logo Sweater','Tech L / S Tee','Shears Hooded Sweatshirt',   'Patchwork Cargo Pant','Stone Washed Slim Jean','Text Stripe New   Era®”,“模糊绒卡车司机夹克”,“ D环风衣”,“多”   条纹S / S上衣”,“管道裤”,“工作裤”,“标签徽标豆豆”,   'Corduroy Compact Logo 6-Panel','Oxford Shirt','Set In Logo   运动裤”,“水洗黑色修身牛仔裤”,“罗斯布法罗格子布”   衬衫”,“拼布钟帽”,“佩斯利条纹L / S上衣”,“模糊绒毛”   短裤”,“扎染防撕裂露营帽”,“缝带裤”,“定期清洗”   吉恩(Jean),“刚性修身吉恩(Jigid Slim Jean)”,“世界5个面板”,“签名脚本徽标训练营”   Cap','Motherfucker 6-Panel'],'price':'\ n
  $ 48 /£46 \ n
  '}

试图使格式看起来像这样:

{title:喷绘花卉滑板,价格:$ 48 /£46,赞成票:14218,赞成票:1034}

1 个答案:

答案 0 :(得分:1)

使用嵌套选择器时,您需要使用适当的相对XPath,否则它将从 entire 响应中提取:

'title': data.xpath(".//h2/text()").get(),

请参阅文档:https://docs.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths