从Google搜索结果中提取评分

时间:2019-01-08 19:21:08

标签: python python-3.x web-scraping

我正在尝试使用python中的google api提取google搜索结果。我能够提取url,链接,标题和代码段。但我也想提取显示在Google搜索结果中的评分。 下面是我正在使用的代码:



$.ajax({
      url: 'http://example/test/profileForm.php',
      data: form,
      processData: false,
      contentType: false,
      type: 'POST',
      success: function (data) {
        $("#loadingIMG").hide();
        $(imgEdit).attr('src', data);
      }
  });

在Google上搜索“ swiggy company review”时,我看到的第一个搜索结果显示为3.7级,但我不知道如何提取该信息。有人可以提出任何解决方案吗? 预先感谢

1 个答案:

答案 0 :(得分:0)

由于 Google API 已被弃用,因此可以使用 BeautifulSoup CCS 选择器 select()(针对多个元素)/select_one()(针对特定元素)轻松完成抓取其他技术中的方法。

代码和full example

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
  'https://www.google.com/search?q=swiggy company review',
  headers=headers).text

soup = BeautifulSoup(response, 'lxml')

# Selects just one Review element (using converted xPath to CSS selector):
# review = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(1)').text
# print(review)

# Selects just one Vote element (using converted xPath to CSS selector):
# votes = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(2)').text
# print(votes)

data = []

# Selects multiple Vote elements:
for something in soup.select('.uo4vr'):
    rating = something.select_one('.uo4vr g-review-stars+ span').text.split(':')[1].strip()
    votes_reviews = something.select_one('.uo4vr span+ span').text.split(' ')[0]

    data.append({
      "Rating": rating,
      "Votes/Reviews": votes_reviews,
    })

print(json.dumps(data, indent=2))

输出:

[
  {
    "Rating": "4",
    "Votes/Reviews": "1,219"
  },
  {
    "Rating": "4",
    "Votes/Reviews": "1,090"
  },
  {
    "Rating": "3.8",
    "Votes/Reviews": "46"
  },
  {
    "Rating": "3.8",
    "Votes/Reviews": "260"
  },
  {
    "Rating": "4.1",
    "Votes/Reviews": "1,047"
  },
  {
    "Rating": "3.3",
    "Votes/Reviews": "47"
  },
  {
    "Rating": "1.5",
    "Votes/Reviews": "114"
  }
]

或者,您可以使用来自 SerpApi 的 Google Organic Results API。这是一个免费试用的付费 API。

要集成的代码:

from serpapi import GoogleSearch
import os, json

params = {
  "engine": "google",
  "q": "swiggy company review",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# For extracting single elements:
# rating = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['rating']
# print(f"Rating: {rating}")

# votes = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['votes']
# print(f"Votes: {votes}")


# For extracing multiple elements:
data = []

for organic_result in results['organic_results']:

  title = organic_result['title']

  try:
    rating = organic_result['rich_snippet']['top']['detected_extensions']['rating']
  except: 
    rating = None
  
  try: 
    votes = organic_result['rich_snippet']['top']['detected_extensions']['votes']
  except:
    votes = None

  try: 
    reviews = organic_result['rich_snippet']['top']['detected_extensions']['reviews']
  except:
    reviews = None

  data.append({
    "Title": title,
    "Rating": rating,
    "Votes": votes,
    "Reviews": reviews,
  })

print(json.dumps(data, indent=2))

输出:

[
  {
    "Title": "Swiggy Reviews | Glassdoor",
    "Rating": 4,
    "Votes": 1219,
    "Reviews": null
  },
  {
    "Title": "Ride.Swiggy: 254 Employee Reviews | Indeed.com",
    "Rating": null,
    "Votes": null,
    "Reviews": null
  }
  {
    "Title": "Working at Swiggy | Glassdoor",
    "Rating": 4,
    "Votes": 1090,
    "Reviews": null
  }
]
<块引用>

免责声明,我为 SerpApi 工作。