美丽的汤-提取信息

时间:2019-09-18 18:28:49

标签: beautifulsoup

尝试从此html摘录中提取信息时遇到一些问题。

到目前为止,我正在使用它来提取下面的html。

#//////////////////////////////
with open('soup.html','r') as f:

    soup = BeautifulSoup(f, 'html.parser')

base = soup.find_all('script', type="application/ld+json")


print(base)
#//////////////////////////////
  1. 如何提取每一行的URL?
  2. 如何为每行提取名称?

这就是我得到的:

[<script type="application/ld+json">
      {"@context":"http://schema.org","@type":"Organization","name":"Redfin","logo":"https://ssl.cdn-redfin.com/static-images/images/redfin-logo-transparent-bg-260x66.png","url":"https://www.redfin.com"}
</script>,
<script type="application/ld+json">
    [{"@context":"http://schema.org","name":"7316 Green St, New Orleans, LA 70118","url":"/LA/New-Orleans/7316-Green-St-70118/home/79443425","address":{"@type":"PostalAddress","streetAddress":"7316 Green St","addressLocality":"New Orleans","addressRegion":"LA","postalCode":"70118","addressCountry":"US"},"numberOfRooms":"6","@type":"SingleFamilyResidence"},{"@context":"http://schema.org","@type":"Product","name":"7316 Green St, New Orleans, LA 70118","offers":{"@type":"Offer","price":"624900","priceCurrency":"USD"},"url":"/LA/New-Orleans/7316-Green-St-70118/home/79443425"}]
</script>,
<script type="application/ld+json">
    [{"@context":"http://schema.org","name":"257 Cherokee St #2, New Orleans, LA 70118","url":"/LA/New-Orleans/257-Cherokee-St-70118/unit-2/home/144766248","address":{"@type":"PostalAddress","streetAddress":"257 Cherokee St #2","addressLocality":"New Orleans","addressRegion":"LA","postalCode":"70118","addressCountry":"US"},"numberOfRooms":"2","@type":"SingleFamilyResidence"},{"@context":"http://schema.org","@type":"Product","name":"257 Cherokee St #2, New Orleans, LA 70118","offers":{"@type":"Offer","price":"129500","priceCurrency":"USD"},"url":"/LA/New-Orleans/257-Cherokee-St-70118/unit-2/home/144766248"}]
</script>, <script type="application/ld+json">

2 个答案:

答案 0 :(得分:0)

显示的结果是一列字典,您应该对其进行迭代并获取所需的值。

答案 1 :(得分:0)

使用json以字典/ json格式阅读,然后可以通过键名调用该项目:

您将需要添加:

import json

那么您可以做:

#//////////////////////////////
with open('soup.html','r') as f:

    soup = BeautifulSoup(f, 'html.parser')

base = soup.find_all('script', type="application/ld+json")


for each in base:
    jsonData = json.loads(each.text)
    url = jsonData['url']
    name = jsonData['name']

    print ('Name: %s\nURL: %s\n' %(name, url))
#//////////////////////////////
相关问题