使用BeautifulSoup提取文本

时间:2017-11-25 03:34:05

标签: python beautifulsoup

我正在尝试从较旧的网页中提取文字,但遇到了问题。检查网页的来源(http://www.presidency.ucsb.edu/ws/index.php?pid=119039),文字开始:

> </div></div><span class="displaytext"><b>PARTICIPANTS:</b><br>Former Secretary of State
> Hillary Clinton (D) and<br>Businessman Donald Trump
> (R)<p><b>MODERATOR:</b><br>Chris Wallace (Fox News)<p><b>WALLACE:</b>
> Good evening from the Thomas and Mack Center at the University of
> Nevada, Las Vegas. I'm Chris Wallace of Fox News, and I welcome you to
> the third and final of the 2016 presidential debates between Secretary
> of State Hillary Clinton and Donald J. Trump.<p>

我尝试使用以下方法提取文字:

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=119039"
debate_response = requests.get(link)
debate_soup = BeautifulSoup(debate_response.content, 'html.parser')
debate_text = debate_soup.find_all('div',{'span class':"displaytext"})
print(debate_text)

但这只是返回一个空列表。知道如何提取文本吗?

1 个答案:

答案 0 :(得分:2)

我不得不使用lxml作为解析器,因为我使用html.parser得到了最大递归错误。以下内容将从<span>标记的子项中提取所有文本到一个字符串中:

debate_soup = BeautifulSoup(debate_response.content, 'lxml')
debate_text = debate_soup.find('span', {'class': 'displaytext'}).get_text()
相关问题