我正试图剪贴一些Wiki页面,只是为了进行培训 而我被困住了,
我要打印页面标题,最后修改日期和类别 这是我的代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest')
soup = BeautifulSoup(response.content, "html.parser")
head=soup.find(class_='firstHeading').get_text()
print('wikipedia entry: '+head)
foot=soup.find(id='footer-info-lastmod').get_text()
print(foot)
cate=soup.find_all(class_='mw-normal-catlinks')
x=soup.findAll("li",attrs={"title"})
print(x)
但是它说: ResultSet对象没有属性“ get_text”。您可能正在将项目列表像单个项目一样对待。当您打算致电find()时,您是否致电过find_all()?
答案 0 :(得分:1)
此脚本打印页眉,页脚和类别列表:
from bs4 import BeautifulSoup
import requests
response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest')
soup = BeautifulSoup(response.content, "html.parser")
head=soup.find(class_='firstHeading').get_text()
print('wikipedia entry: {}'.format(head)) # better use str.format()
foot=soup.find(id='footer-info-lastmod').get_text(strip=True) # use strip=True to strip the text of whitespace characters
print(foot)
categories = [li.get_text() for li in soup.select('#mw-normal-catlinks li')]
print(categories)
打印:
wikipedia entry: Eurovision Song Contest
This page was last edited on 6 December 2019, at 10:20(UTC).
['Eurovision Song Contest', '1956 establishments in Europe', 'Eurovision events', 'Music television', 'Pop music festivals', 'Recurring events established in 1956', 'Song contests']
答案 1 :(得分:1)
您可以通过查找父div来解决问题:
代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
response = requests.get('https://en.wikipedia.org/wiki/Eurovision_Song_Contest')
soup = BeautifulSoup(response.content, "html.parser")
head=soup.find(class_='firstHeading').get_text()
print('wikipedia entry: '+head)
foot=soup.find(id='footer-info-lastmod').get_text()
print(foot)
cate=soup.find_all(class_='mw-normal-catlinks')
catdiv = soup.find("div",{"id":"mw-normal-catlinks"})
categories = catdiv.find("ul").find_all("li")
for cat in categories:
print(cat.text)
结果:
wikipedia entry: Eurovision Song Contest
This page was last edited on 6 December 2019, at 10:20 (UTC).
Eurovision Song Contest
1956 establishments in Europe
Eurovision events
Music television
Pop music festivals
Recurring events established in 1956
Song contests
答案 2 :(得分:1)
更简单:
normal=soup.find(class_="mw-normal-catlinks")
categories=normal.find_all("a", )
for category in categories:
print(category.text)
答案 3 :(得分:0)
您的脚本可以完美地打印“头”和“脚”,因此我将重点介绍打印类别列表。
首先,find_all()返回一个标签列表,而不是单个标签,因此在标签列表上尝试'get_text()'会导致错误。
cate=soup.find_all(class_='mw-normal-catlinks')
print(cate.get_text())
AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
在您的情况下,由于find_all()仅返回一个标签,因此您可以使用'find()'或从返回的列表中提取标签(div)。
cate=soup.find_all(class_='mw-normal-catlinks')[0]
您的类别在'ul'标记下,这里是'div'标记的子项(您使用find_all()提取了该标记),因此您可以直接访问它们并将其存储在这样的列表中-
cate=soup.find_all(class_='mw-normal-catlinks')[0]
x=cate.ul.get_text("|")
categoryList = x.split("|")
print(categoryList)
输出: [“欧洲歌唱大赛”,“ 1956年在欧洲的机构”,“欧洲电视网的活动”,“音乐电视”,“流行音乐节”,“ 1956年成立的重复性活动”,“歌曲比赛”]