Scrapy returns more results than expected

时间:2016-07-11 19:51:33

标签: python json web-scraping scrapy web-crawler

This is a continuation of the question: Extract from dynamic JSON response with Scrapy

I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results).

For example for 17 values provided in test.txt file it returns 289 results, that means 17 times more than expected.

Spider content below:

import scrapy
import json
from whois.items import WhoisItem

class whoislistSpider(scrapy.Spider):
    name = "whois_list"
    start_urls = []
    f = open('test.txt', 'r')
    global lines
    lines = f.read().splitlines()
    f.close()
    def __init__(self):
        for line in lines:
            self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)

    def parse(self, response):
        for line in lines:
            jsonresponse = json.loads(response.body_as_unicode())
            item = WhoisItem()
            domain_name = list(jsonresponse['domains'].keys())[0]
            item["avail"] = jsonresponse["domains"][domain_name]["avail"]
            item["domain"] = domain_name
            yield item

items.py content below

import scrapy

class WhoisItem(scrapy.Item):
    avail = scrapy.Field()
    domain = scrapy.Field()

pipelines.py below

class WhoisPipeline(object):
    def process_item(self, item, spider):
        return item

Thank you in advance for all the replies.

1 个答案:

答案 0 :(得分:1)

The parse function should be like this:

def parse(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = WhoisItem()
    domain_name = list(jsonresponse['domains'].keys())[0]
    item["avail"] = jsonresponse["domains"][domain_name]["avail"]
    item["domain"] = domain_name
    yield item

Notice that I removed the for loop.

What was happening: for every single response you would loop and parse it 17 times. (Therefore resulting in 17*17 records)