从<div>标记中的<em>标记获取文本结果

时间:2017-07-23 20:49:03

标签: python html beautifulsoup

我正在尝试制作一个网络抓取工具,它会获取以下数据: 标题,图像src,描述和位置。除了位于标签内的位置之外的所有上述工作。

此链接显示我正在使用的代码:https://pastebin.com/BFZyyhxB

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

title = soup.title
image = soup.image
strong = soup.strong
description = soup.description
location = soup.location


title = soup.find('h1', class_='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('a', 'href', 'em') #This is either done incorrectly or needs more added
description = soup.find('div', class_='description')

print(title.text)
print(image)
print(strong.text)
print(description.string)
print(location)

这显示了我想要抓取的HTML结构。包括em代码:&#39; https://pastebin.com/zHy7H220&#39;

<div class="teaser"><figure data-mod="image" data-init="true"><div class="spacer" style="padding-top:66.50%;"></div>


<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">
<img srcset="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s180/Mike-Grimshaw.jpg 180w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s390/Mike-Grimshaw.jpg 390w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s458/Mike-Grimshaw.jpg 458w" src="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s615/Mike-Grimshaw.jpg">
</a>
</figure>
<div class="inner">
<em><a href="http://www.manchestereveningnews.co.uk/all-about/sale">Sale</a></em> <------------------ text within the <em> tag is what i am trying to get.
<strong>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">Mum who witnessed fiancé Michael Grimshaw being fatally stabbed 'cannot face returning home'</a></strong><div class="description">
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">A fundraising campaign has been set up to help Mr Grimshaw's family in the wake of his tragic death</a>
</div>
</div>
</div>

你可以看到它什么都不返回,这意味着我的代码不正确。然而,我无法找到如何解决这个问题,无数次尝试寻找教程。

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:2)

好的,<em>标签封装了锚标签。如果您想在该锚点内使用href链接,我相信您需要:

location = soup.find('em').find('a')['href']

如果是您想要的文字,则用

完成
location = soup.find('em').find('a').string # or .text

soup.find需要一个标记,以及一个指定任何css选择器的可选dict参数。您使用的语法不正确。

答案 1 :(得分:2)

您可以使用css Selector来做到这一点。

soup.select_one("div em > a").get_text(strip=True)