无法从奇怪的json内容中提取项目

时间:2019-12-21 06:41:52

标签: python json python-3.x web-scraping

我正在尝试从json内容中获取一些内容。但是,该json内容的结构对我来说是陌生的,结果我无法从中获取property的值。

到目前为止,我已经尝试过:

import json
import requests
from bs4 import BeautifulSoup

link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'

def fetch_content(link):
    content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(content.text,"lxml")
    item = soup.select_one("script#hdpApolloPreloadedData").text
    print(json.loads(item)['apiCache'])

if __name__ == '__main__':
    fetch_content(link)

我运行上述脚本的结果是:

{"VariantQuery{\"zpid\":43835884}":{"property":{"zpid":43835884,"streetAddress":"5958 SW 4th St",

我无法进一步处理前面那个奇怪的钥匙。

预期输出:

{"zpid":43835884,"streetAddress":"5958 SW 4th St", ----

如何获取该属性的值?

2 个答案:

答案 0 :(得分:2)

您可以通过以下方式获取zpid和地址,即其错位的json:

Could not cast value of type '__NSCFString' (0x1084207a0) to 'NSNumber' (0x10493ed40).

我注意到您总是可以这样获得zpid:

json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['zpid']                                                                                  
Out[1889]: 43835884

json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['streetAddress']                                                                         
Out[1890]: '5958 SW 4th St'

答案 1 :(得分:1)

只需将您的功能修改为以下内容。我还添加了另一个功能(process_fetched_content()),为您提供了更多自由。您可以简单地运行它,即使您有多个以'VariantQuery{"zpid":'开头的键,它也可以处理情况。最终输出是一个dict,其中的键是您的zpid,值是您要查找的值。

如果您有很多zpid值,那么这将使您将它们全部累加起来然后进行处理。好处是密钥列表就是您拥有的zpid列表。

  

这是如何使用此代码的方法。

results = process_fetched_content(raw_dictionary = fetch_content(link, verbose=False))
print(results)

输出

{'43835884': {'zpid': 43835884, 'streetAddress': '5958 SW 4th St', 'zipcode': '33144', 'city': 'Miami', 'state': 'FL', 'latitude': 25.76661, 'longitude': -80.292801, 'price': 340000, 'dateSold': 1576875600000, 'bathrooms': 2, 'bedrooms': 3, 'livingArea': 1757, 'yearBuilt': 1973, 'lotSize': 4331, 'homeType': 'SINGLE_FAMILY', 'homeStatus': 'RECENTLY_SOLD', 'photoCount': 19, 'imageLink': 'https://photos.zillowstatic.com/p_g/IS7yxihwtuqmlq1000000000.jpg', 'daysOnZillow': 0, 'isFeatured': False, 'shouldHighlight': False, 'brokerId': 0, 'zestimate': 341336, 'rentZestimate': 2200, 'listing_sub_type': {}, 'priceReduction': '', 'isUnmappable': False, 'rentalPetsFlags': 128, 'mediumImageLink': 'https://photos.zillowstatic.com/p_c/IS7yxihwtuqmlq1000000000.jpg', 'isPreforeclosureAuction': False, 'homeStatusForHDP': 'RECENTLY_SOLD', 'priceForHDP': 340000, 'festimate': 341336, 'isListingOwnedByCurrentSignedInAgent': False, 'isListingClaimedByCurrentSignedInUser': False, 'hiResImageLink': 'https://photos.zillowstatic.com/p_f/IS7yxihwtuqmlq1000000000.jpg', 'watchImageLink': 'https://photos.zillowstatic.com/p_j/IS7yxihwtuqmlq1000000000.jpg', 'tvImageLink': 'https://photos.zillowstatic.com/p_m/IS7yxihwtuqmlq1000000000.jpg', 'tvCollectionImageLink': 'https://photos.zillowstatic.com/p_l/IS7yxihwtuqmlq1000000000.jpg', 'tvHighResImageLink': 'https://photos.zillowstatic.com/p_n/IS7yxihwtuqmlq1000000000.jpg', 'zillowHasRightsToImages': True, 'desktopWebHdpImageLink': 'https://photos.zillowstatic.com/p_h/IS7yxihwtuqmlq1000000000.jpg', 'isNonOwnerOccupied': False, 'hideZestimate': False, 'isPremierBuilder': False, 'isZillowOwned': False, 'currency': 'USD', 'country': 'USA', 'taxAssessedValue': 224131, 'streetAddressOnly': '5958 SW 4th St', 'unit': ' '}}

代码

import json
import requests
from bs4 import BeautifulSoup

link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'

def fetch_content(link, verbose=False):
    content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(content.text,"lxml")
    item = soup.select_one("script#hdpApolloPreloadedData").text
    d = json.loads(item)['apiCache']
    d = json.loads(d)
    if verbose:
        print(d)
    return d

def process_fetched_content(raw_dictionary=None):
    if raw_dictionary is not None:
        keys = [k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')]
        results = dict((k.split(':')[-1].replace('}',''), d.get(k).get('property', None)) for k in keys)
        return results
    else:
        return None