Nokogiri HTML嵌套元素提取类和文本

时间:2016-12-09 02:48:09

标签: html ruby nokogiri

我有一个基本的页面结构,其中元素(span' s)嵌套在其他元素(div' s和span' s)下。这是一个例子:

{ // Question
  "q": "<h4>Is it recommended to use one vial of BOTOX<sup>&reg;</sup> on more than one patient?</h4>",
  "a": [
  {"option": "TRUE",      "correct": false},
  {"option": "FALSE",     "correct": true}
  ],
  "correct": "<p><span>CORRECT!</span> As the diluents in BOTOX<sup>®</sup> do not contain a preservative it is not recommended to use on more than one patient.2 If you would like to refer to this information again, it can be found via the link below.</p>",
  "incorrect": "<p><span>INCORRECT.</span> As the diluents in BOTOX<sup>®</sup> do not contain a preservative it is not recommended to use on more than one patient.<sup>2</sup> If you would like to refer to this information again, it can be found via the link below.</p><p>***here is where I would like the link to be***</p>" 
},

请注意,类名是随机的。另请注意,html中有空格和制表符。

我想提取孩子并最终得到像这样的哈希:

html = "<html>
  <body>
    <div class="item">
         <div class="profile">
      <span class="itemize">
         <div class="r12321">Plains</div>
          <div class="as124223">Trains</div>
           <div class="qwss12311232">Automobiles</div>
      </div>
      <div class="profile">
        <span class="itemize">
          <div class="lknoijojkljl98799999">Love</div>
           <div class="vssdfsd0809809">First</div>
            <div class="awefsaf98098">Sight</div>
      </div>
    </div>
  </body>
</html>"

结果应类似于:

page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
  children = divs.children
  children.each do |child|
    itemhash[child['class']] = child.text
  end
end

但我最终得到了这样的混乱:

 {\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}

这是因为HTML中的制表符和空格。我无法控制HTML的生成方式,所以我试图解决这个问题。我尝试了noblanks,但那不起作用。我也试过gsub,但这只会破坏我的标记。

如何在干净地忽略空格和制表符的同时提取这些嵌套元素的类和值?

P.S。我没有挂在Nokogiri身上 - 所以如果另一颗宝石可以做得更好我就是游戏。

1 个答案:

答案 0 :(得分:1)

children方法返回所有子节点,包括文本节点 - 即使它们是空的。

要仅获取子元素,您可以执行显式XPath查询(或可能是等效的CSS),例如:

children = divs.xpath('./div')

您还可以使用children_elements method,它将更接近您已经在做的事情,并且仅返回作为元素的子项:

children = divs.element_children