如何获取锚标记的HREF属性?

时间:2013-05-28 13:40:53

标签: ruby nokogiri

我正试图从http://expo.getbootstrap.com/

抓取网站

HTML是这样的:

<div class="col-span-4">
  <p>
    <a class="thumbnail" target="_blank" href="https://www.getsentry.com/">
      <img src="/screenshots/sentry.jpg">
    </a>
  </p>
</div>

我的Nokogiri代码是:

url = "http://expo.getbootstrap.com/"
doc = Nokogiri::HTML(open(url))
puts doc.css("title").text
doc.css(".col-span-4").each do |site|
  title=site.css("h4 a").text
  href = site.css("a.thumbnail")[0]['href']
end  

目标很简单,获取href<img>代码的href和网站的<title>,但会不断报告:

undefined method [] for nil:NilClass 

在该行:

href = site.css("a.thumbnail")[0]['href']

这真让我抓狂,因为我在这里写的代码实际上是在另一种情况下工作。

2 个答案:

答案 0 :(得分:2)

我会做类似的事情:

require 'nokogiri'
require 'open-uri'
require 'pp'

doc = Nokogiri::HTML(open('http://expo.getbootstrap.com/'))

thumbnails = doc.search('a.thumbnail').map{ |thumbnail|
  {
    href: thumbnail['href'],
    src: thumbnail.at('img')['src'],
    title: thumbnail.parent.parent.at('h4 a').text
  }
}

pp thumbnails

其中,跑完后有:

# => [
  {
    :href => "https://www.getsentry.com/",
    :src => "/screenshots/sentry.jpg",
    :title => "Sentry"
  },
  {
    :href => "http://laravel.com",
    :src => "/screenshots/laravel.jpg",
    :title => "Laravel"
  },
  {
    :href => "http://gruntjs.com",
    :src => "/screenshots/gruntjs.jpg",
    :title => "Grunt"
  },
  {
    :href => "http://labs.bittorrent.com",
    :src => "/screenshots/bittorrent-labs.jpg",
    :title => "BitTorrent Labs"
  },
  {
    :href => "https://www.easybring.com/en",
    :src => "/screenshots/easybring.jpg",
    :title => "Easybring"
  },
  {
    :href => "http://developers.kippt.com/",
    :src => "/screenshots/kippt-developers.jpg",
    :title => "Kippt Developers"
  },
  {
    :href => "http://www.learndot.com/",
    :src => "/screenshots/learndot.jpg",
    :title => "Learndot"
  },
  {
    :href=>"http://getflywheel.com/",
    :src=>"/screenshots/flywheel.jpg",
    :title=>"Flywheel"
}
]

答案 1 :(得分:1)

您没有考虑到并非所有.col-span-4 div都包含缩略图的事实。这应该有效:

url = "http://expo.getbootstrap.com/"
doc = Nokogiri::HTML(open(url))
puts doc.css("title").text
doc.css(".col-span-4").each do |site|
  title = site.css("h4 a").text
  thumbnail = site.css("a.thumbnail")
  next if thumbnail.empty?
  href = thumbnail[0]['href']
end