使用BeautifulSoup只考虑网页内容的某个部分

时间:2014-05-19 04:19:31

标签: python web-scraping html-parsing beautifulsoup webpage

我怎样才能让BeautifulSoup只考虑网页的某些内容?

例如,我想在页面http://www.dailypress.com/上的“目前查看次数最多”之后选取所有div代码

它是:

from bs4 import BeautifulSoup
import urllib2

url = ' http://www.dailypress.com/ '
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

我可以使用:

str(soup).find(' Most viewed right now')

找到句子,但是它无法确定我想要的部分内容。

1 个答案:

答案 0 :(得分:1)

找到包含查看次数最多的文章的div并查找其中的所有链接:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> import re
>>> url = "http://www.dailypress.com"
>>> soup = BeautifulSoup(urllib2.urlopen(url))
>>> most_viewed = soup.find('div', class_=re.compile('mostViewed'))
>>> for item in most_viewed.find_all('a'):
...     print item.text.strip()
... 
Body of driver recovered from Chesapeake Bay Bridge-Tunnel wreck
Hampton police looking for man linked to Friday's fatal apartment shooting
Police identify suspect in Saturday's fatal shooting in Hampton
Teen spice user: 'It's the new crack'
When spice came to Gloucester

这里的诀窍是我们首先找到Most Viewed个链接的容器 - 它是divmostViewed。您可以在浏览器开发人员工具的帮助下进行检查。