Question

我需要使用Nokogiri解析本地HTML文件，但HTML没有任何<div>个类。它以文字开头。

这是HTML：

high prices in <a href="Example 1">Example 1</a><br>
low prices in <a href="Example 2">Example 2</a><br>

在这种情况下，我只需要得到＆＃34;高＆＃34;和＆＃34;低＆＃34;和＆＃34;示例1＆＃34;和＆＃34;示例2＆＃34;。

如何获取没有元素的文本？从我看到的教程中，需要一些<div class= ...>来获取文本。

doc.xpath('//a/@href').each do |node|   #get performance indicators
      link = node.text

      @test << Entry2.new(link)

    end

    @title = doc.xpath('//p').text.scan(/^(high|low)/)

我的观点：

   <% @test.each do |entry| %>


    <p>  <%= entry.link %></p>

<% end %>


<% @title.each do |f| %>
    <p>  <%= f %></p>


<% end %>

输出如下：

Example 1Example 2

[["high"], ["low"]]

它同时列出所有内容而不是一个一个。如何在输出中将我的Nokogiri代码更改为这样？

high prices in Example 1
low prices in Example 2

Answer 1

好吧，Nokogiri会将该字符串包装在隐式<html><body><p>...中，因此文本将在单个<p>

中

所以，是的，您将能够以结构化形式获取链接：

doc.xpath "//a"

“高”和“低”字符串将位于单个文本块中。您可能需要使用一些正则表达式将它们拉出来，这将取决于您的要求和数据，但这里是您正在展示和要求的正则表达式：

doc.xpath('//p').text.scan(/^(high|low)/)

我不能确定具体对您的实际要求有多大帮助，但希望它能为您指明方向。

如何在没有标签的情况下使用Nokogiri抓取HTML？

1 个答案: