Python 3 BeautifulSoup 在“阅读更多”文本后抓取内容

时间:2021-06-26 02:26:39

标签: python python-3.x web-scraping beautifulsoup

我最近开始考虑购买一些土地,我正在编写一个小应用程序来帮助我组织 Jira/Confluence 中的详细信息,以帮助我跟踪与谁交谈以及与他们交谈的内容对每一块土地单独进行。

所以,我为landwatch(dot)com写了这个小爬虫:

[url 只是网站上的列表]

from bs4 import BeautifulSoup
import requests


def get_property_data(url):
    headers = ({'User-Agent':
                    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    response = requests.get(url, headers=headers)  # Maybe request Url with read more already gone
    soup = BeautifulSoup(response.text, 'html5lib')
    title = soup.find_all(class_='b442a')[0].text
    details = soup.find_all('p', class_='d19de')
    price = soup.find_all('div', class_='_260f0')[0].text
    deets = []
    for i in range(len(details)):
        if details[i].text != '':
            deets.append(details[i].text)
    detail = ''
    for i in deets:
        detail += '<p>' + i + '</p>'
    return [title, detail, price]

除了 d19de 类在 Read More 按钮后面隐藏了大量值外,一切都很好。

在 Google 上搜索时,我发现了 How to Scrape reviews with read more from Webpages using BeautifulSoup,但是我要么不明白他们在哪些方面做得足够好以实施它,要么这不再起作用:

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
    items = title.get('href')
    if items:
        broth = BeautifulSoup(requests.get(items).text, "html.parser")
        for item in broth.select("div.user-review p.lnhgt"):
            print(item.text)

关于如何绕过 Read More 按钮的任何想法?我真的希望在 BeautifulSoup 中做到这一点,而不是 selenium。

以下是用于测试的示例 URL:https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403

1 个答案:

答案 0 :(得分:1)

该数据存在于 script 标记中。以下是提取该内容、使用 json 进行解析并将土地描述信息作为列表输出的示例:

from bs4 import BeautifulSoup
import requests, json

url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
                    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers)  # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')

all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r') 

您可能希望检查该 script 标签中的其他内容:

from pprint import pprint

pprint(all_data)
相关问题