如何刮取数据忽略嵌入式标签

时间:2017-01-03 00:13:04

标签: ruby-on-rails ruby web-scraping nokogiri

<select class="exampleSelect">
  <option></option>
  <option value="test">Want event to fire</option>
</select>

我正在尝试搜索“最后销售”,“销售日期”和“目前待售”值的数据,除了内部的所有内容

<div class="seperate">
    <h2>Public info</h2>
    <p>
        <strong>Property type:</strong> Semi-detached house |
        <strong>Tenure:</strong> Leasehold |
        <strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
        <span id="sold-prices" class="none">
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £75,000 | 
                        <strong>Sale date:</strong> 
     3rd Oct 2006
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £36,000 | 
                        <strong>Sale date:</strong> 
    26th Sep 2002
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £39,950 | 
                        <strong>Sale date:</strong> 
    27th Jan 1995
                            <span class="new-build">New build</span>
        </span>
        | <a href="/for-sale/details/42175871"><i class="icon icon-home nolink"></i>Currently for sale</a>
    </p>
</div>

我知道我可以做到

<span id="sold-prices" class="none">

将HTML放在单独的div中,但我不知道如何抓取我想要的标签的数据。有什么想法吗?

1 个答案:

答案 0 :(得分:2)

在Nokogiri完成处理HTML之后,它很容易找到并操纵节点。有时这意味着有选择地删除节点以简化DOM。这是其中一次:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="seperate">
  <p>
    <strong>Property type:</strong> Semi-detached house |
    <strong>Tenure:</strong> Leasehold |
    <strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
    <span id="sold-prices" class="none">
      <br>
          <strong>Property type:</strong>
          Semi-detached house | 
          <strong>Tenure:</strong>
          Leasehold | 
    </span>
  </p>
</div>
EOT

doc.at('#sold-prices').remove
data = doc.search('strong').map{ |strong|
    [strong.text, strong.next_sibling.text.tr('|', '').strip]
}.to_h

data # => {"Property type:"=>"Semi-detached house", "Tenure:"=>"Leasehold", "Last sale:"=>"£71,000", "Sale date:"=>"5th Dec 2007 -"}

诀窍是:

doc.at('#sold-prices').remove

摆脱了森林,所以你可以看到你想要的树木。

需要更多的清理结果数据,但其余的代码应该是不言自明的,所以调整它应该很容易。