Python Scrapy未输出到CSV文件

时间:2018-09-12 21:42:08

标签: python scrapy

该脚本在做什么错,因此没有输出包含数据的csv文件?我正在使用scrapy runspider yellowpages.py -o items.csv运行脚本,但是除了空白的csv文件之外,什么都没出来。我在这里遵循了不同的做法,还观看了youtube试图弄清楚我在哪里犯了错误,但仍然无法弄清楚我做错了什么。

# -*- coding: utf-8 -*-
import scrapy
import requests

search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()


class YellowpagesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['yellowpages.com']
    start_urls = [page]

    def parse(self, response):
        self.log("I just visited: " + response.url)
        items = response.css('a[class=business-name]::attr(href)')
        for item in items:
            print(item)

3 个答案:

答案 0 :(得分:3)

没有项目的简单蜘蛛。

使用我的代码,我编写了注释以使其更易于理解。该蜘蛛程序在所有页面上寻找一对参数“ service”和“ location”的所有块。要运行,请使用:

在您的情况下:

  

scrapy runpider yellowpages.py -a servise =“ Plumbers” -a location =“洛杉矶哈蒙德” -o Hammondsplumbers.csv

该代码也可用于任何查询。例如:

  

scrapy runpider yellowpages.py -a servise =“ Doctors” -a location =“ California,MD” -o MDDoctors.json

等...

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider


class YellowpagesSpider(scrapy.Spider):
    name = 'yellowpages'
    allowed_domains = ['yellowpages.com']
    start_urls = ['https://www.yellowpages.com/']

    # We can use any pair servise + location on our request
    def __init__(self, servise=None, location=None):
        self.servise = servise
        self.location = location

    def parse(self, response):
        # If "service " and" location " are defined 
        if self.servise and self.location:
            # Create search phrase using "service" and " location"
            search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
            # Send request with url "yellowpages.com" + "search_url", then call parse_result
            yield Request(url=response.urljoin(search_url), callback=self.parse_result)
        else:
            # Else close our spider
            # You can add deffault value if you want.
            self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
            raise CloseSpider()

    def parse_result(self, response):
        # all blocks without AD posts
        posts = response.xpath('//div[@class="search-results organic"]//div[@class="v-card"]')
        for post in posts:
            yield {
                'title': post.xpath('.//span[@itemprop="name"]/text()').extract_first(),
                'url': response.urljoin(post.xpath('.//a[@class="business-name"]/@href').extract_first()),
            }

        next_page = response.xpath('//a[@class="next ajax-page"]/@href').extract_first()
        # If we have next page url
        if next_page:
            # Send request with url "yellowpages.com" + "next_page", then call parse_result
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)

答案 1 :(得分:0)

在检查您的代码时,我注意到许多问题:

首先,将items初始化为元组(当它应为列表时:items = []。)

您应该更改name属性,以在爬网程序上反映出您想要的名称,以便可以这样使用它:scrapy crawl my_crawler,其中name = "my_crawler"

start_urls应该包含字符串,而不是Request对象。您应该将条目从page更改为要使用的确切搜索字符串。如果您有许多搜索字符串,并且想要遍历它们,我建议使用middleware

当您尝试从CSS中提取数据时,您会忘记调用extract_all(),而这实际上会将选择器转换为可以使用的字符串数据。

此外,您不应该重定向到标准输出流,因为在那里进行了大量日志记录,这会使您的输出文件真正混乱。相反,您应该使用items将响应提取到loaders中。

最后,您可能缺少settings.py文件中的适当设置。您可以找到相关的文档here

FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]

答案 2 :(得分:0)

for item in items:
    print(item)

在此处打印而不是打印产量

for item in items:
    yield item
相关问题