ruby使用选择性标记分割行,这些标记在文件中以任何顺序出现多次

时间:2017-06-06 15:46:09

标签: ruby

我有一个文件,其内容如下:

this is test line 1 
this is testing purpose 
<public>
am inside of public
doing lot of stuffs and priting result here
</public>
<public>
am inside of another public
doing another set of stuffs and priting here
</public>

我想将此文件拆分为三个不同的部分:

  1. 不在任何部分内的行
  2. 第一部分内部的行
  3. 第二部分内部的行
  4. 我尝试使用take_whiledrop_while

    File.open(filename).each_line.take_while do |l|
      !l.include?('</public>')
    end.drop_while do |l|
      !l.include?('<public>')
    end.drop(1))
    

    但它仅提取第一个<public> ... </public>部分。

    在某些情况下,订单可能会发生变化,例如公共部分将首先出现,其余内容将在最后或中间出现。如果内容顺序与上面的模板相同,那么我可以按照下面的方法

    File.read(filename).split(/<\/?public>/)
                       .map(&:strip)
                       .reject(&:empty?)
    

    我从Split lines using tags that appear multiple times in file得到答案。

    但是看一些通用方法,无论如何我都可以处理数据。

    我正在寻找更好的解决方案。任何建议都将不胜感激。

2 个答案:

答案 0 :(得分:1)

考虑一下:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<root>
this is test line 1 
<public>
am inside of public
</public>
<public>
am inside of another public
</public>
</root>
EOT

text_inside_public_tags = doc.search('public').map(&:text)
# => ["\n" +
#    "am inside of public\n", "\n" +
#    "am inside of another public\n"]

doc.search('public').each(&:remove)

text_outside_public_tags = doc.at('root').text
# => "\n" +
#    "this is test line 1 \n" +
#    "\n" +
#    "\n"

答案 1 :(得分:1)

您可以在此处使用Ruby flip-flop operator

<强>代码

def dissect(str)
  arr = str.lines.map(&:strip)
  grp, ungrp = [], []
  arr.each { |line| line=='<public>'..line=='</public>' ? (grp << line) : ungrp << line }
  [grp.slice_when { |s,t| s == '</public>' && t == '<public>' }.
       map { |a| a[1..-2] },
   ungrp]
end 

该方法的最后一个语句构造了该方法返回的数组,可以替换为以下语句。

b = grp.count('<public>').times.with_object([]) do |_,a|
  ndx = grp.index('</public>')
  a << grp[1..ndx-1]
  grp = grp[ndx+1..-1] if ndx < grp.size-1
end
[b, ungrp]

示例

str =<<-EOS
this is test line 1 
this is testing purpose 
<public>
am inside of public
doing lot of stuffs and printing result here
</public>
let's stick another line here
<public>
am inside of another public
doing another set of stuffs and printing here
</public>
and another line here
EOS

grouped, ungrouped = dissect(str)
  #=> [
  #     [ ["am inside of public",
  #        "doing lot of stuffs and printing result here"],
  #       ["am inside of another public",
  #        "doing another set of stuffs and printing here"]
  #     ],
  #     [
  #       "this is test line 1",
  #       "this is testing purpose",
  #       "let's stick another line here",
  #       "and another line here"]
  #     ]
  #   ]
grouped
  #=> [ ["am inside of public",
  #      "doing lot of stuffs and printing result here"],
  #     ["am inside of another public",
  #      "doing another set of stuffs and printing here"]
  #   ]
ungrouped
  #=> ["this is test line 1",
  #    "this is testing purpose",
  #    "let's stick another line here",
  #    "and another line here"]

<强>解释

对于上面的例子,步骤如下。

arr = str.lines.map(&:strip)
  #=> ["this is test line 1", "this is testing purpose", "<public>",
  #    "am inside of public", "doing lot of stuffs and printing result here",
  #    "</public>", "let's stick another line here", "<public>",
  #    "am inside of another public", "doing another set of stuffs and printing here",
  #    "</public>", "and another line here"]

ungrp, grp = [], []
arr.each { |line| line=='<public>'..line=='</public>' ? (grp << line) : ungrp << line }

触发器返回false,直到line=='<public>'true。然后它返回true并继续返回true,直到 line=='</public>'之后的true。然后它返回false,直到它再次遇到line=='<public>'true的行,依此类推。

ungrp
  #=> <returns the value of 'ungrouped' in the example>
grp
  #=> ["<public>",
  #    "am inside of public",
  #    "doing lot of stuffs and printing result here",
  #    "</public>",
  #    "<public>",
  #    "am inside of another public",
  #    "doing another set of stuffs and printing here",
  #    "</public>"]
enum = grp.slice_when { |s,t| s == '</public>' && t == '<public>' }
  #=> #<Enumerator: #<Enumerator::Generator:0x00000

Enumerable#slice_when,它在Ruby v2.2中首次亮相。

我们可以看到这个枚举器通过将它转换为数组而生成的元素。

enum.to_a
  #=> [["<public>", "am inside of public",
  #     "doing lot of stuffs and printing result here", "</public>"],
  #    ["<public>", "am inside of another public",
  #     "doing another set of stuffs and printing here", "</public>"]]

最后,

enum.map { |a| a[1..-2] }
  #=> <returns the array 'grouped' in the example>