Question

我正在尝试通过使用Python中的Scrapy库从最高社区网站上获取标题，价格和赞成/反对投票的统计信息。

import scrapy


class SupremeSpider(scrapy.Spider):
    name = "Supreme"
    start_urls = [
        'https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/'
    ]

    def parse(self, response):
        for data in response.css('div.card-details'):
            yield {
                'title': data.xpath("//h2/text()").getall(),
                'price': data.css('span.label-price::text').get()     
                #'upvotes': data.xpath("//p/text()").getall()
                #'downvotes': quote.css('div.tags a.tag::text').getall(),
            }

当我在CMD中运行抓取抓取Supreme时

结果如下：

2019-02-27 14:19:09 [scrapy.core.scraper]调试：从<200刮下来   https://www.supremecommunity.com/season/spring-summer2019/droplist/2019-02-25/>   {'title'：['Airbrushed Floral Skateboard'，'Formula Crewneck'，   'Supreme®/ MasterLock®数字组合锁'，'Supreme®/ SIGG™CYD   1.0升水瓶”，“腰包”，“爬行者T恤”，“粉碎T恤”，“ FREE GIFT 浴帽”，“ Christopher Walken纽约之王T恤”，“盘子”   毛巾（三件套）”，“金属打火机皮套”，“粘结徽标浮肿”   夹克”，“单肩包”，“雪尼尔连帽运动衫”，“背包”，   “套色无檐小便帽”，“水果T恤”，“结T恤”，“整理袋”，   'Supreme®/Hanes®豹纹平角内裤（2件装）'，'行李袋'，'   Real Shit L / S Tee”，“ Red Rum棒球服”，“Supreme®/Hanes®拳击手”   内裤（4件装）”，“儿童T恤”，“玩具Uzi充气枕头”，“苹果”   连帽运动衫”，“ Spotlight钥匙扣”，“Supreme®/Hanes®船员袜”   （4包）”，“带缝线夹克”，“前三通”，“水果滑板”，   “ Hard Goods Tee”，“ Leda And The Swan Tee”，“ Military Camp Cap”，   'Leather Varsity Jacket'，'Patchwork Harrington Jacket'，'Formula   Sweatpant”，“Supreme®/Hanes®无标签T恤（3件装）”，“ I Make Shit Shi”   Happen Pin”，“ Leda和天鹅滑板”，“ Sin Tee原创”，   “ Clouds L / S上衣”，“赛车徽标工作衬衫”，“真丝迷彩衬衫”，   “自由女神吊坠”，“色欲陶瓷盒子”，“管道”   夹克”，“拼布马海毛开襟衫”，“Supreme®/Hanes®豹纹无标签”   “ T恤（2包）”，“徽标徽标连帽套头运动衫”，“Supreme®/Spitfire®”   经典车轮（4个一组）”，“世界三通中指”，“ S / S”   Pocket Tee”，“Supreme®/Independent®Truck”，“ GORE-TEX S-Logo 6-Panel”，   'Tag Logo Sweater'，'Tech L / S Tee'，'Shears Hooded Sweatshirt'，   'Patchwork Cargo Pant'，'Stone Washed Slim Jean'，'Text Stripe New   Era®”，“模糊绒卡车司机夹克”，“ D环风衣”，“多”   条纹S / S上衣”，“管道裤”，“工作裤”，“标签徽标豆豆”，   'Corduroy Compact Logo 6-Panel'，'Oxford Shirt'，'Set In Logo   运动裤”，“水洗黑色修身牛仔裤”，“罗斯布法罗格子布”   衬衫”，“拼布钟帽”，“佩斯利条纹L / S上衣”，“模糊绒毛”   短裤”，“扎染防撕裂露营帽”，“缝带裤”，“定期清洗”   吉恩（Jean），“刚性修身吉恩（Jigid Slim Jean）”，“世界5个面板”，“签名脚本徽标训练营”   Cap'，'Motherfucker 6-Panel']，'price'：'\ n
  $ 48 /£46 \ n
  '}

试图使格式看起来像这样：

{title：喷绘花卉滑板，价格：$ 48 /£46，赞成票：14218，赞成票：1034}

Answer 1

使用嵌套选择器时，您需要使用适当的相对XPath，否则它将从 entire 响应中提取：

'title': data.xpath(".//h2/text()").get(),

请参阅文档：https://docs.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths

收集最高社区网站的统计信息

1 个答案: