如何从网站(例如 Udacity)中提取课程名称/学校/描述

时间:2021-08-02 00:35:29

标签: python beautifulsoup

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.udacity.com/courses/all")
soup = BeautifulSoup(r.text)
summaries = soup.find_all("li", class_="") #using "card-list_catalogCardListItem__aUQtx" for class_ resulted in 0 case
print('Number of Courses:', len(summaries)) #this finds 225 case 

summaries[7].select_one("li").get_text().strip() #output: 'AI for Business Leaders'
summaries[7].select_one("a").get_text().strip() #output:'Artificial Intelligence'

courses = []
for summary in summaries:
    title = summary.select_one("a").get_text().strip()
    school = summary.select_one("li").get_text().strip()
    courses.append((title, school))
#to get all the summaries text extraction will result in "AttributeError: 'NoneType' object has no attribute 'get_text'"

Udacity Website html snippet

出于教育目的,为了提取

1)所有优达学城课程 2)在什么学校 3)简短说明

我尝试使用“find_all”来使用上述代码。我的手动搜索表明页面上有 264 门课程。我最初使用了 'find_all("li", class_="card-list_catalogCardListItem__aUQtx")' 标签,结果为 0。当我将 class_ 留空时,最接近的数字是 225,只是为了测试。但是,当我打算使用“for 循环”来提取所有课程时,这最终会导致 AttributeError。这可能是因为并非所有找到的摘要都是可读的“'NoneType' object has no attribute 'get_text'”。

我的问题:我怎样才能做到这一点? (因为 find_all 标签发现似乎失败)

1 个答案:

答案 0 :(得分:1)

通过向以下地址发送 GET 请求来动态加载页面:

https://www.udacity.com/data/catalog.json?v=%223cd8649e%22

您可以向该链接发送请求以接收所有数据,您可以在其中以 Python 字典 (dict) 的形式访问键/值:

import requests


url = "https://www.udacity.com/data/catalog.json?v=%223cd8649e%22"
response = requests.get(url).json()

for data in response:
    course = data["payload"]
    if "shortSummary" in course:
        print("{:<50} {:<60} {:<50}".format(course["school"], course["title"], course["shortSummary"]))

输出(截断):

School of Data Science                             Data Engineer                                                Data Engineering is the foundation for the new world of Big Data. Enroll now to build production-ready data infrastructure, an essential skill for advancing your data career.
School of Data Science                             Data Scientist                                               Build effective machine learning models, run data pipelines, build recommendation systems, and deploy solutions to the cloud with industry-aligned projects.
School of Data Science                             Data Analyst                                                 Use Python, SQL, and statistics to uncover insights, communicate critical findings, and create data-driven solutions.
School of Data Science                             Programming for Data Science with Python                     Learn the fundamental programming tools for data professionals: Python, SQL, the Terminal and Git.
School of Autonomous Systems                       C++                                                          Get hands-on experience by building five real-world projects.
School of Product Management                       Product Manager                                              Envision and execute the development of industry-defining products, and learn how to successfully bring them to market.

使用 {:<50} {:<60} {:<50} 会将文本左对齐指定的数量。