Scrapy:两个蜘蛛的输出比较

时间:2017-01-19 10:54:59

标签: python html css web-scraping scrapy

我正在学习使用scrapy并且作为练习我正在编写蜘蛛来抓取不同的网站。在此示例中:https://www.thuisbezorgd.nl/eten-bestellen-castricumhttps://www.iens.nl/restaurant+zoetermeer。有些事情我不明白;让我们比较这两种蜘蛛:

蜘蛛1:

import scrapy
from datetime import datetime
from scrapy import Request
import urllib.parse as urlparse
from scrapy.loader import ItemLoader
from iensScraper.items import IensscraperItem
from scrapy.crawler import CrawlerProcess

class IensSpider(scrapy.Spider):
    name ="ienzz"
    start_urls = ['https://www.iens.nl/restaurant+zoetermeer']

    domain = ['https://www.iens.nl/']

    def parse(self, response):
        restaurants = response.css('.resultItem')
        items = [(restaurant.css('[href]::text').extract_first(),restaurant.css('.resultItem-address::text').extract_first(),restaurant.css('.rating-ratingValue::text').extract_first(), restaurant.css('.reviewsCount>[href]::text').extract_first()) for restaurant in restaurants]
        for item in items:
            holder = ItemLoader(item=IensscraperItem(),response=response)
            holder.add_value('naam',item[0])
            holder.add_value('adres',item[1])
            holder.add_value('rating',item[2])
            holder.add_value('recensies',item[3])
            yield holder.load_item()

        if(response.xpath('//*[@class="next"]//@href').extract()):
            link = response.css('.next>a::attr(href)').extract()
            yield Request(urlparse.urljoin(response.url,link[0]),callback=self.parse,dont_filter=True)

process = CrawlerProcess()
process.crawl(Iensspider)
process.start()

输出蜘蛛1:

2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
           '                            Middelwaard 86 2716 CW Zoetermeer\n'
           '                                                    '],
 'naam': ['Meerzicht']}
2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
           '                            Burgemeester van Leeuwenpassage 2 2711 '
           'JV Zoetermeer\n'
           '                                                    '],
 'naam': ['Brandcafé Zoetermeer']}
2017-01-19 11:48:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.iens.nl/restaurant+zoetermeer?page=2>
{'adres': ['\n'
           '                            Van der Hagenstraat 22 2722 NT '
           'Zoetermeer\n'
           '                                                    '],
 'naam': ['Taste of Asia']}

蜘蛛2:

import scrapy
import urllib.parse as urlparse
from scrapy import Request
from scrapy.loader import ItemLoader
from scrapy.crawler import CrawlerProcess
from thuisbezorgdscraper.items import ThuisbezorgdscraperItem
class ThuisSpider(scrapy.Spider):
    name = 'spiderman'
    domain = ['https://www.thuisbezorgd.nl']
    start_urls = ['https://www.thuisbezorgd.nl/eten-bestellen-castricum']

    def parse(self, response):
        raw_urls = response.css('.delarea')
        urls = raw_urls.css('::attr(href)').extract()
        for url in urls:
            yield Request(urlparse.urljoin(response.url, url),callback=self.parse_item, dont_filter=True)

    def parse_item(self, response):
        restaurants = response.css('.restaurant.grid')
        for restaurant in restaurants:
            l = ItemLoader(item=ThuisbezorgdscraperItem(), response=response)
            name = restaurant.css('.restaurantname[itemprop]::text').extract()
            address = restaurant.css('.restaurantaddress::text').extract()
            score = restaurant.css('.pointy::attr(title)').extract()
            reviews = restaurant.css('.nrofreviews::text').extract()
            l.add_value('address', address)
            l.add_value('name',name)
            l.add_value('score', score)
            l.add_value('reviews', reviews)
            yield l.load_item()

process = CrawlerProcess()
process.crawl(ThuisSpider)
process.start()

输出蜘蛛2:

2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-castricum-zuid-1901>
{'address': ['\n\t\t      Rijksweg 2', '\t      '],
 'name': ['New York Pizza'],
 'reviews': ['200 recensies'],
 'score': ['Klantbeoordeling: 7 / 10']}
2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-castricum-noord-1902>
{'address': ['\n\t\t      Heemskerkerweg 93', '\t      '],
 'name': ['Fresco'],
 'reviews': ['1420 recensies'],
 'score': ['Klantbeoordeling: 8 / 10']}
2017-01-19 11:12:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.thuisbezorgd.nl/eten-bestellen-castricum-centrum-1901>
{'address': ['\n\t\t      Boulevard 13', '\t      '],
 'name': ['iSAFIA'],
 'reviews': ['285 recensies'],
 'score': ['Klantbeoordeling: 8 / 10']}

蜘蛛1:第一个蜘蛛抓取网站&#39; iens&#39;就像我想要的那样;在解析了起始网址中的所有信息之后,它使用索引来移动到下一页来解析该页面等。通过查看输出来确认此蜘蛛的功能。首先返回初始页面上的所有餐馆,然后返回第二页上的所有餐馆,直到没有剩余的页面为止。

蜘蛛2: 第二个蜘蛛的结构略有不同。它首先从开始URL中提取所需的URL,然后开始抓取提取的URL。我希望这只蜘蛛就像第一只蜘蛛一样行动;从第一个网址刮掉所有的餐馆,然后从第二个网址刮掉所有的餐馆,直到没有更多的网址。然而,第二个蜘蛛同时从所有提取的URL中擦除餐馆。这个蜘蛛从一个提取的网址中产生一个餐馆,然后从另一个提取的网址中产生另一个餐馆,直到没有剩下的餐馆。

问题:为什么这些蜘蛛表现得如此不同? (蜘蛛有相同的设置!)

0 个答案:

没有答案
相关问题