Question

所以，我是网络抓取工具的新手，我在抓取一个简单的JSON文件并从中检索链接时遇到了一些困难。我正在使用scrapy框架来尝试实现这一目标。

我的JSON示例文件：

{

"pages": [

{

  "address":"http://foo.bar.com/p1",

  "links": ["http://foo.bar.com/p2",

   "http://foo.bar.com/p3", "http://foo.bar.com/p4"]

 },

 {

  "address":"http://foo.bar.com/p2",

  "links": ["http://foo.bar.com/p2",

   "http://foo.bar.com/p4"]

  },

 {

   "address":"http://foo.bar.com/p4",

    "links": ["http://foo.bar.com/p5",

    "http://foo.bar.com/p1", "http://foo.bar.com/p6"]

   },

   {

      "address":"http://foo.bar.com/p5",

       "links": []

     },

     {

       "address":"http://foo.bar.com/p6",

       "links": ["http://foo.bar.com/p7",

        "http://foo.bar.com/p4", "http://foo.bar.com/p5"]

      }

    ]

  }

我的items.py文件

import scrapy
from scrapy.item import Item, Field


class FoobarItem(Item):
     # define the fields for your item here like:
    title = Field()
    link = Field()

我的蜘蛛文件

from scrapy.spider import Spider
from scrapy.selector import Selector
from foobar.items import FoobarItem

class MySpider(Spider):
    name = "foo"
    allowed_domains = ["localhost"]
    start_urls = ["http://localhost/testdata.json"]


   def parse(self, response):
       yield response.url

最终我想抓取文件并返回对象中的链接而没有重复，但是现在我甚至都在努力抓取json。我以为上面的代码会爬过json对象并返回链接，但我的输出文件是空的。不确定我做错了什么，但任何帮助都会受到赞赏

Answer 1

首先，你需要有一种方法来解析json文件，json lib应该做得很好。接下来的一点就是使用url运行你的爬虫。

import json

with open("myExample.json", 'r') as infile:
     contents = json.load(infile)

#contents is now a dictionary of your json but it's a json array/list
#continuing on you would then iterate through each dictionary 
#and fetch the pieces you need.

    links_list = []
    for item in contents:
         for key, val in item.items():
               if 'http' in key:
                    links_list.append(key)
               if 'http' in value:
                  if isinstance(value, list):
                       for link in value:
                              links_list.append(link)
                  else:
                       links_list.append(value)
    #get rid of dupes
    links_list = list(set(links_list))
#do rest of your crawling with list of links

从JSON文件爬行链接

1 个答案: