Question

我需要帮助从Google搜索结果中提取网址，并被告知要使用Nokogiri。我安装了它并阅读了Nokogiri文档，但不知道从哪里开始 - 这对我来说都是希腊语。

我知道我要找的是每个结果的网址，每个结果都存在于<cite>标记之间。到目前为止，我能够弄清楚如何做的是拉取搜索结果，但我只是不知道如何从文件中提取特定数据。这是我做的极少量代码：

serp = Nokogiri::HTML(open("http://www.google.com/search?num=100&q=stackoverflow"))

Answer 1

享受：）

require 'open-uri'
require 'nokogiri'

page = open "http://www.google.com/search?num=100&q=stackoverflow"
html = Nokogiri::HTML page

html.search("cite").each do |cite|
  puts cite.inner_text
end

另请参阅nokogiri tutorials

Answer 2

确保您使用的是 user-agent（标头），否则它将返回空输出，因为 Google 不会将请求视为真正的用户访问。 What is my user-agent。

如果 num 参数设置为 100，它可能会抛出错误，因为在某些时候将没有结果。要忽略它，您可以将其包装在 exception block 中。

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

代码和example in the online IDE：

require 'nokogiri'
require 'httparty'
require 'json'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "stackoverflow",
  num: "100"
}

response = HTTParty.get('https://www.google.com/search',
                        :query => params,
                        :headers => headers)
doc = Nokogiri::HTML(response.body)

data = []

begin
    doc.css('.tF2Cxc').each do |result|
      title = result.css('.DKV0Md').first.text
      link = result.css('.yuRUbf a').first["href"]
      displayed_link = result.css('.tjvcx').first.text
      snippet = result.css('.VwiC3b').first.text
      # puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"

      data << {
        :title => title,
        :link => link,
        :displayed_link => displayed_link,
        :snippet => snippet,
      }
    end
rescue; end  # do nothing if an error occurs 

puts JSON.pretty_generate(data)

--------
=begin
[
  {
    "title": "Stack for Stack Overflow - Apps on Google Play",
    "link": "https://play.google.com/store/apps/details?id=me.tylerbwong.stack&hl=en_US&gl=US",
    "displayed_link": "https://play.google.com › store › apps › details",
    "snippet": "Stack is powered by Stack Overflow and other Stack Exchange sites. Search and filter through questions to find the exact answer you're looking for!"
  }
...
]
=end

或者，您可以从 SerpApi Google Organic Results API。这是一个带有免费计划的付费 API。

主要区别在于无需弄清楚如何抓取页面的某些部分。所需要做的只是迭代结构化的 json 字符串。

require 'google_search_results' 
require 'json'

params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "stackoverflow",
  hl: "en",
  num: "100"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

data = []

hash_results[:organic_results].each do |result|
  title = result[:title]
  link = result[:link]
  displayed_link = result[:displayed_link]
  snippet = result[:snippet]

  data << {
    :title => title,
    :link => link,
    :displayed_link => displayed_link,
    :snippet => snippet
  }
end
  puts JSON.pretty_generate(data)


-------------
=begin
]
  {
    "title": "Stack Overflow - Home | Facebook",
    "link": "https://www.facebook.com/officialstackoverflow/",
    "displayed_link": "https://www.facebook.com › Pages › Interest",
    "snippet": "Stack Overflow. 519455 likes · 587 talking about this. We are the world's programmer community."
  }
...
]
=end

<块引用>

免责声明，我为 SerpApi 工作。

如何使用Nokogiri解析Google搜索结果？

2 个答案: