Ruby gsub并不适用于新行

时间:2016-01-12 18:51:11

标签: ruby

这让我很难过。我有一个字符串,它是一个冗长的XHTML:

irb(main):012:0> input = <<-END
irb(main):013:0" <p><span class=\"caps\">ICES</span> evaluated the management plan in 2009
 and found it to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being based on lengths, excludes the problem connected with age estimation.</p>\n<p><span class=\"caps\">SSB</span> 
 index is estimated to have decreased by more than 20% between the periods 2010–2012 
 (average of the three years) and 2013–2014 (average of the two years).</p>\n<p>A candidate 
 multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre><code><p>
 The management plan, agreed October 2007 and implemented January 2008 was evaluated by 
 <span class=\"caps\">ICES</span> as to its accordance with the precautionary approach and 
 reviewed by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes 
 enter the fishery discarding is expected to further increase, justifying the implementation 
 of measures to improve gear selectivity, such as increases in mesh size 
 (<span class=\"caps\">ICES</span>, 2009a).</p></code></pre>
irb(main):014:0" END
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it to 
 be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being based 
 on lengths, excludes the problem connected with age estimation.</p>\n<p><span class=\"caps\">SSB
 </span> index is estimated to have decreased by more than 20% between the periods 2010–2012 
 (average of the three years) and 2013–2014 (average of the two years).</p>\n<p>A candidate 
 multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre><code><p>The 
 management plan, agreed October 2007 and implemented January 2008 was evaluated by <span 
 class=\"caps\">ICES</span> as to its accordance with the precautionary approach and reviewed 
 by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes enter the 
 fishery discarding is expected to further increase, justifying the implementation of 
 measures to improve gear selectivity, such as increases in mesh size (<span class=\"caps\">ICES
 </span>, 2009a).</p></code></pre>\n"

现在我想删除&lt; pre&gt;&lt; code&gt;中包含的文字。标签,但它失败了:

irb(main):015:0> input.gsub(/<pre>.*<\/pre>/,'')
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it
 to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being 
 based on lengths, excludes the problem connected with age estimation.</p>\n<p><span 
 class=\"caps\">SSB</span> index is estimated to have decreased by more than 20% between the 
 periods 2010–2012 (average of the three years) and 2013–2014 (average of the two years).</p>\n
 <p>A candidate multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre>
 <code><p>The management plan, agreed October 2007 and implemented January 2008 was evaluated 
 by <span class=\"caps\">ICES</span> as to its accordance with the precautionary approach 
 and reviewed by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes 
 enter the fishery discarding is expected to further increase, justifying the implementation 
 of measures to improve gear selectivity, such as increases in mesh size (<span class=\"caps\">ICES</span>, 2009a).</p></code></pre>\n"

如果我先删除换行符,那么它会:

irb(main):016:0> input.gsub(/\n/,'').gsub(/<pre>.*<\/pre>/,'')
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it 
 to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being 
 based on lengths, excludes the problem connected with age estimation.</p><p><span 
 class=\"caps\">SSB</span> index is estimated to have decreased by more than 20% between the 
 periods 2010–2012 (average of the three years) and 2013–2014 (average of the two years).</p>
 <p>A candidate multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p>"

我错过了什么?

2 个答案:

答案 0 :(得分:2)

试试这个:

input.gsub(/<pre>.*<\/pre>/m,'')

m switch告诉正则表达式将输入视为多行。

答案 1 :(得分:0)

不清楚你想要什么。是否要从<pre><code>块中删除文本,或者是否要删除文本和包装标记

这将删除块内的内容(文本):

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<pre><code><p>foo</p></code></pre>
EOT

doc.search('pre code').each do |pc|
  pc.content = ''
end

puts doc.to_html 
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <pre><code></code></pre>
# >> </body></html>

这会删除内容和<code>标记:

doc.search('pre code').each do |pc|
  pc.remove
end

puts doc.to_html 

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <pre></pre>
# >> </body></html>

您可以移除<pre>代码,这样也会删除<code>代码和内容:

doc.search('pre').each do |pc|
  pc.remove
end

puts doc.to_html        

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> </body></html>

除了非常简单的普通用例外,您应该依赖解析器。 gsub和常规表达式将引导您走上一条路径,直到HTML发生变化并且您的代码爆炸,或者更糟糕的是,只是做错了事情并返回错误的结果。