使用BeauitfulSoup提取跨度类文本时无任何回报

时间:2019-01-15 23:01:21

标签: python html python-3.x web-scraping beautifulsoup

我正试图抓住《纽约时报》首页(www.nytimes.com)的头条新闻。 无需通过soup.find_all函数打印文本(或其他任何内容)即可完成处理。

我一直在研究语法,将其从soup.find_all(class_="blancedHeadline")更改为soup.find_all("span", {"class" : "blancedHeadline")甚至在类区分之前添加attrs=

这是我的代码,在尝试找出一段时间以来我做错的事情后,我不知道是什么原因引起的:

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'lxml')

headline_text = soup.find_all('span', {'class':'balancedHeadline'})

for headline in headline_text:
    print(headline)

2 个答案:

答案 0 :(得分:2)

首先,为什么不能使用“ blancedHeadline”类获得输出是因为页面是使用javascript部分呈现的。您可能可以在浏览器的“检查”工具中看到它。但是,如果您去检查页面源,则该页面源将不存在。

第二,即使您可以从h2标签获得头条新闻,页面上也会出现其他一些h2标签。因此,我们需要使用父div的类名称来隔离标题,然后获取输出。

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'lxml')
headline_text = soup.find('div',class_="css-11bbiel").find_all('h2')
for headline in headline_text:
    print(headline.text)

输出

Brexit Deal Fails in Parliament; May Faces No-Confidence Vote
Brexit, explained: Here’s what it all means.
Here’s what could happen next.
William Barr Vows to Protect Justice Dept. Integrity
Court Blocks Trump Administration From Asking About Citizenship in Census
Here are highlights from the Senate confirmation hearing.
House Votes to Condemn White Supremacy After King Comments
King Loses Committee Seats Over Remark
We put together a timeline of Mr. King’s history of racist actions.
Democrats Jilt Trump on Lunch but Look for Shutdown Exit
‘The Shutdown Makes Me Nervous’: Young People Caught in Impasse
Shutdown turmoil at a New York jail: Prisoners went on a hunger strike after family visits were canceled over staffing shortages.
Ex-Mexican President Took $100 Million Bribe, El Chapo Trial Witness Says
Last week, The Times reported on how a Colombian I.T. expert helped the authorities take down the kingpin.
Carol Channing, Larger-Than-Life Broadway Star, Dies at 97
Even From Afar, Channing Served Up That Broadway Wow
Theater colleagues recalled Ms. Channing as a tireless performer and promoter who had little use for doctors’ orders.
Britain Is a Nation in Desperate Need of a Driver
Why Steve King’s Punishment Took So Long 
Next to a National Park, People Plan for Winter. No One Planned for This.
How to Make New York as Progressive on Criminal Justice as Texas
The Cruelty of Call-Out Culture
Donald Trump and His Team of Morons
Republicans Condemn Steve King’s Racism? How Convenient
Our National Emergency Turns 2
Is 2019 Over Yet?
Donald Trump: The Russia File
Actually, the Numbers Show That We Need More Immigration, Not Less
Writer Moves From ‘Moonlight’ to Broadway, and Beyond
The Gay Penguins of Australia
Benno, Proudly Out of Step With the Age

此外,这次我们很幸运,该页面仅使用javascript来更改样式。并非总是如此。或者,您可以使用selenium

答案 1 :(得分:1)

好吧,我真的看不到您提到的班级名称。如果您看到页面源,则所有标题都在标签“ h2”中。尝试下面的代码,您可以从输出中进一步提取文本。

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'lxml')

headline_text = soup.find_all('h2')

for headline in headline_text:
    print(headline)