Python:下载谷歌搜索链接,多个页面/新闻搜索

时间:2018-04-26 19:09:44

标签: python beautifulsoup

我正在尝试从谷歌搜索下载链接(在Python中),我正在使用美丽的汤来做到这一点。 http://www.google.ca/search?q=QUERY_HERE是我收到请求的网址。我希望从第2页/第3页获得更多链接。

如何执行此操作,以及如何仅使用Google新闻搜索进行搜索?

2 个答案:

答案 0 :(得分:0)

首先使用google.com页面右下角的搜索设置选项为您找出每页结果设置。 或检查下面的链接是否仍然有效

https://www.google.co.in/preferences?hl=en

然后在查询中,您可以指定开始

https://www.google.co.in/search?q=hello&hl=en---------- 开始= 70 --------。

因此,如果 start = 0 ,您就在第一页,然后您只需根据每页结果更改开始值。

答案 1 :(得分:0)

要仅使用 Google 新闻进行搜索,您需要将 tbm=nws 添加到您的网址。 https://www.google.com/search?q=coca+cola --> https://www.google.com/search?q=coca+cola&tbm=nws

以下是使用 beautifulsouprequestsurllib 库抓取实际分页的方法。

online IDE 中的代码和示例:

from bs4 import BeautifulSoup
import requests, urllib.parse

def paginate(url, previous_url=None):
    # Break from infinite recursion
    if url == previous_url: return

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }

    response = requests.get(url, headers=headers).text
    soup = BeautifulSoup(response, 'lxml')

    # First page
    yield soup

    next_page_node = soup.select_one('a#pnnext')

    # Stop when there is no next page
    if next_page_node is None: return

    next_page_url = urllib.parse.urljoin('https://www.google.com/',
                                         next_page_node['href'])

    # Pages after the first one
    yield from paginate(next_page_url, url)


def scrape():
    pages = paginate(
        "https://www.google.com/search?hl=en-US&q=coca+cola&tbm=nws")

    for soup in pages:
        print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
        print()

        for data in soup.findAll('div', class_='dbsr'):
            title = data.find('div', class_='JheGif nDgy9d').text
            link = data.a['href']

            print(f'Title: {title}')
            print(f'Link: {link}')
            print()

# part of the output:
'''
Results via beautifulsoup

Current page: 1
Title: A Post-Truth World: Why Ronaldo Did Not Move Coca-Cola Share Price
Link: https://www.forbes.com/sites/iese/2021/06/19/a-post-truth-world-why-ronaldo-did-not-move-coca-cola-share-price/

...

Current page: 22

Title: The Coca-Cola Co. unveils oat milk line
Link: https://www.foodbusinessnews.net/articles/18356-the-coca-cola-co-unveils-oat-milk-line
'''

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。查看playground

要集成的代码:

# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os

def scrape():
  params = {
    "engine": "google",
    "q": "coca cola",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  pages = search.pagination()

  for result in pages:
    print(f"Current page: {result['serpapi_pagination']['current']}")

    for news_result in result["news_results"]:
        print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

# part of the output:
'''
Results from SerpApi

Current page: 1
Title: A Post-Truth World: Why Ronaldo Did Not Move Coca-Cola Share Price
Link: https://www.forbes.com/sites/iese/2021/06/19/a-post-truth-world-why-ronaldo-did-not-move-coca-cola-share-price/

...

Current page: 5
Title: Coca-Cola, Monster win appeal of $9.6 million verdict over ...
Link: https://www.reuters.com/legal/transactional/coca-cola-monster-win-appeal-96-million-verdict-over-hansens-rights-2021-06-18/
'''
<块引用>

免责声明,我为 SerpApi 工作。