有没有办法擦除或分离网页抓取数据?在Python中

时间:2015-12-07 05:01:31

标签: python html5 web-scraping beautifulsoup python-requests

您好我正在从ABC新闻网站上抓取最新消息,我抓的代码看起来像这样:

 <a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a>

但是你注意到我在一个标签里面有一个span标签,所以当我用BeautifulSoup刮掉它时,我得到这样的信息:

  

Huckabee在西岸定居点筹款活动中欢呼41分钟前

但是它给了我完全在我的数据旁边的时间,我想分开41分钟,所以它看起来像这样:

  

Huckabee在西岸定居点筹款活动41分钟前欢呼雀跃

或至少删除它!。

我的代码如下:

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

for x in range(1,10):
   for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}):
    print link.text
    print link.find_all("",{"class": "metaH_timeDay"})[0].text
    print ""

有人可以帮助我吗?

2 个答案:

答案 0 :(得分:1)

让我们通过extract()提取它:

>>> link.span.extract()     # remove the first `span` tag that we don't need
>>> time = link.span.extract()
>>> time
<span class="metaH_timeDay">2 hours, 45 minutes ago</span>
>>> link.text
' Obama Seeks to Remove Fear From ISIS Fight'
>>> time.text
'2 hours, 45 minutes ago'
>>> 

答案 1 :(得分:1)

您可以使用decompose()功能 - 运行一段时间来删除span中的所有div标记 -

import requests
from bs4 import BeautifulSoup

url = "http://abcnews.go.com/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for x in range(1):
    d=soup.select("div.h a")
    for j in d:
        j = str(j)
        f = BeautifulSoup(j,'html.parser')
        while f.span:
            f.span.decompose()
        print f.text.encode('utf-8') 

输出 -

 Obama Seeks to Remove Fear From ISIS Fight
Kerry off to Paris Again for Climate Conference
Huckabee Draws Cheers at Fundraiser for West Bank Settlement
Sanders Unveils Plan to Address Climate Change
 FBI Looking Into Blatter's Role in Bribery Case
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House
13 Injured in Attack on Government Office in Western China
Police Arrest Mother of Newborn Baby Who Was Buried Alive
Shooting Suspect's Neighbor Says He Became 'More Withdrawn'
 Justice Department to Investigate Chicago Police
Hillary Clinton Corrects Flub, Thanks to Justice Breyer
 Dashcam Must Be Working
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ
 Man Killed in Wisconsin Standoff Was a Hostage
 2 New York College Students Abducted, Held Hostage
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69
 Mood Dour Among Venezuelan Ruling Party Backers
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
Jimmy Carter Says Latest Brain Scan Shows No Cancer
One Direction Leads the Way on Twitter's List of 2015 Tweets
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds
Medical Examiner Shortage: Facts About Death Investigations
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds