使用美丽的汤蟒蛇解析谷歌新闻

时间:2016-05-02 23:15:49

标签: python web-scraping beautifulsoup

我有如下的python代码。它搜索谷歌新闻页面并打印每个新闻的超链接和标题。我的问题是,谷歌新闻组在一个桶和下面的脚本中相似的新闻只打印每个桶中的第一个新闻。如何从所有桶中打印所有新内容?

from bs4 import BeautifulSoup
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

#r = requests.get('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts', headers=headers)
r = requests.get('https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d', headers=headers)
r = requests.get('https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

letters = soup.find_all("div", class_="_cnc")
#print soup.prettify() 
#print letters
print type(letters)
print len(letters)
print("\n")

for x in range(0, len(letters)):
    print x
    print letters[x].a["href"]


print("\n")

letters2 = soup.find_all("a", class_="l _HId")
for x in range(0, len(letters2)):
    print x
    print letters2[x].get_text()

print ("\n----------content")
#print letters[0]

通过热议新闻我的意思是在下面的图片中,前几条新闻被组合在一起。 “LeBron James将他的一个队友与Denn比较”的消息是另一组的一部分。

enter image description here

2 个答案:

答案 0 :(得分:1)

我不确定你的意思是什么?如果您的意思是说您正在尝试解析多个网站,那么我可以通过发送多条新闻r来告诉您覆盖requests.get()

这是一个循环来处理urls数组中的所有URL。

import bs4
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}


urls = ["https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
        "https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"]

ahrefs = []
titles = []

for url in urls:
    req = requests.get(url, headers=headers)
    soup = bs4.BeautifulSoup(req.text, "html.parser")

    #you don't even have to process the div container
    #just go strait to <a> and using indexing get "href"
    #headlines
    ahref  = [a["href"] for a in soup.find_all("a", class_="_HId")]
    #"buckets"
    ahref += [a["href"] for a in soup.find_all("a", class_="_sQb")]
    ahrefs.append(ahref)

    #or get_text() will return the array inside the hyperlink
    #the title you want
    title =  [a.get_text() for a in soup.find_all("a", class_="_HId")]
    title += [a.get_text() for a in soup.find_all("a", class_="_sQb")]
    titles.append(title)

#print(ahrefs)
#print(titles)

我的Google搜索lebron会显示18个结果,包括副标题和len(ahrefs[1]) == 18

答案 1 :(得分:1)

随着一个全新的转变,我决定采取更有效的方式来解决这个问题,这样你只需追加查询来搜索新玩家。我不确定你想要什么样的最终结果,但这会返回一个字典列表。

import bs4
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}


#just add to this list for each new player
#player name : url
queries = {"bledsoe":"https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
           "james":"https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"}


total = []

for player in queries: #keys

    #request the google query url of each player
    req  = requests.get(queries[player], headers=headers)
    soup = bs4.BeautifulSoup(req.text, "html.parser")

    #look for the main container
    for each in soup.find_all("div"):
        results = {player: { \
            "link": None,    \
            "title": None,   \
            "source": None,  \
            "time": None}    \
        }

        try:
          #if <div> doesn't have class="anything"
          #it will throw a keyerror, just ignore

          if "_cnc" in each.attrs["class"]: #mainstories
            results[player]["link"] = each.find("a")["href"]
            results[player]["title"] = each.find("a").get_text()
            sourceAndTime = each.contents[1].get_text().split("-")
            results[player]["source"], results[player]["time"] = sourceAndTime
            total.append(results)

          elif "card-section" in each.attrs["class"]: #buckets
            results[player]["link"] = each.find("a")["href"]
            results[player]["title"] = each.find("a").get_text()
            results[player]["source"] = each.contents[1].contents[0].get_text()
            results[player]["time"] = each.contents[1].get_text()
            total.append(results)

        except KeyError:
            pass