BeautifulSoup找到多个类别

时间:2019-12-06 11:57:34

标签: python beautifulsoup

我正试图剪贴一些Wiki页面,只是为了进行培训 而我被困住了,

我要打印页面标题,最后修改日期和类别 这是我的代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd


response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest') 
soup = BeautifulSoup(response.content, "html.parser") 


head=soup.find(class_='firstHeading').get_text()
print('wikipedia entry: '+head)

foot=soup.find(id='footer-info-lastmod').get_text()
print(foot)

cate=soup.find_all(class_='mw-normal-catlinks')
x=soup.findAll("li",attrs={"title"})
print(x)

但是它说: ResultSet对象没有属性“ get_text”。您可能正在将项目列表像单个项目一样对待。当您打算致电find()时,您是否致电过find_all()?

我需要打印:类别列表 例如在此页面上: enter image description here

4 个答案:

答案 0 :(得分:1)

此脚本打印页眉,页脚和类别列表:

from bs4 import BeautifulSoup
import requests

response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest')
soup = BeautifulSoup(response.content, "html.parser")

head=soup.find(class_='firstHeading').get_text()
print('wikipedia entry: {}'.format(head))      # better use str.format()

foot=soup.find(id='footer-info-lastmod').get_text(strip=True)   # use strip=True to strip the text of whitespace characters
print(foot)

categories = [li.get_text() for li in soup.select('#mw-normal-catlinks li')]
print(categories)

打印:

wikipedia entry: Eurovision Song Contest
This page was last edited on 6 December 2019, at 10:20(UTC).
['Eurovision Song Contest', '1956 establishments in Europe', 'Eurovision events', 'Music television', 'Pop music festivals', 'Recurring events established in 1956', 'Song contests']

答案 1 :(得分:1)

您可以通过查找父div来解决问题:

代码

from bs4 import BeautifulSoup
 import requests
 import pandas as pd


 response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest') 
 soup = BeautifulSoup(response.content, "html.parser") 


 head=soup.find(class_='firstHeading').get_text()
 print('wikipedia entry: '+head)

 foot=soup.find(id='footer-info-lastmod').get_text()
 print(foot)

 cate=soup.find_all(class_='mw-normal-catlinks')
 catdiv = soup.find("div",{"id":"mw-normal-catlinks"})
 categories = catdiv.find("ul").find_all("li")
 for cat in categories:
     print(cat.text)

结果:

wikipedia entry: Eurovision Song Contest
 This page was last edited on 6 December 2019, at 10:20 (UTC).
Eurovision Song Contest
1956 establishments in Europe
Eurovision events
Music television
Pop music festivals
Recurring events established in 1956
Song contests

答案 2 :(得分:1)

更简单:

normal=soup.find(class_="mw-normal-catlinks")
categories=normal.find_all("a", )
for category in categories:    
        print(category.text)

答案 3 :(得分:0)

您的脚本可以完美地打印“头”和“脚”,因此我将重点介绍打印类别列表。

首先,find_all()返回一个标签列表,而不是单个标签,因此在标签列表上尝试'get_text()'会导致错误。

cate=soup.find_all(class_='mw-normal-catlinks')
print(cate.get_text())

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

在您的情况下,由于find_all()仅返回一个标签,因此您可以使用'find()'或从返回的列表中提取标签(div)。

cate=soup.find_all(class_='mw-normal-catlinks')[0]

您的类别在'ul'标记下,这里是'div'标记的子项(您使用find_all()提取了该标记),因此您可以直接访问它们并将其存储在这样的列表中-

cate=soup.find_all(class_='mw-normal-catlinks')[0]

x=cate.ul.get_text("|")

categoryList = x.split("|")

print(categoryList)

输出: [“欧洲歌唱大赛”,“ 1956年在欧洲的机构”,“欧洲电视网的活动”,“音乐电视”,“流行音乐节”,“ 1956年成立的重复性活动”,“歌曲比赛”]

相关问题