uniq for Enumerator :: Lazy

时间:2014-12-14 22:34:02

标签: ruby

我正在处理有很多重复行的东西:

# => [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
puts xs

# => [ [1, "A"], [2, "B"] ]
puts xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }

但是 xs 是巨大的。我试图懒洋洋地加载它,但 Enumerator #Lazy 没有 uniq 方法。

如何懒惰地实现这个目标?

3 个答案:

答案 0 :(得分:7)

module EnumeratorLazyUniq
  refine Enumerator::Lazy do
    require 'set'
    def uniq
      set = Set.new
      select { |e|
        val = block_given? ? yield(e) : e
        !set.include?(val).tap { |exists|
          set << val unless exists
        }
      }
    end
  end
end

using EnumeratorLazyUniq
xs = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ].to_enum.lazy

us = xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
puts us.to_a.inspect
# => [[1, "A"], [2, "B"]]
# Works with a block

puts us.class
# => Enumerator::Lazy
# Yep, still lazy.

ns = [1, 4, 6, 1, 2].to_enum.lazy
puts ns.uniq.to_a.inspect
# => [1, 4, 6, 2]
# Works without a block

这是使用Set直接实现的;这意味着任何uniq的值(即[1, "A"]之类的内容,而不是[1, "A", 23626]之类的流元素本身)将占用内存。

答案 1 :(得分:2)

我决定对我建议的两种方法和@Amadan的方法进行基准测试。结果不言而喻。

基准代码

require 'benchmark'

module EnumeratorLazyUniq
  refine Enumerator::Lazy do
    require 'set'
    def uniq
      set = Set.new
      select { |e|
        val = block_given? ? yield(e) : e
        !set.include?(val).tap { |exists|
          set << val unless exists
        }
      }
    end
  end
end

using EnumeratorLazyUniq

def amadan(xs)
  xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
end

require 'set'

def cary_set(arr)
  first = Set.new
  arr.each_with_object([]) do |(a0, a1, *_), b|
    unless first.include?(a0)
      first << a0 
      b << [a0, a1]
    end
  end
end

def cary_hash(arr)
  arr.each_with_object({}) { |(a0, a1, *_), h|
    h[a0]=[a0, a1] unless h.key?(a0) }.values
end

测试数据

n_uniq = 10_000
n_copies = 100
tot = n_uniq * n_copies

xs = tot.times.map { |i| [i % n_uniq, 0, 1] }

运行基准

Benchmark.bm do |x|
  x.report("cary_set ") { cary_set(xs)  }
  x.report("cary_hash") { cary_hash(xs) }
  x.report("amadan   ") { amadan(xs)    }
end

<强>结果

Unique elements: 200,000
Number of copies of each unique element: 5

Array size: 1,000,000
       user     system      total        real
cary_set   0.980000   0.030000   1.010000 (  1.018618)
cary_hash  0.980000   0.010000   0.990000 (  0.982508)
amadan     0.590000   0.010000   0.600000 (  0.597249)


Unique elements: 100,000
Number of copies of each unique element: 10
Array size: 1,000,000

       user     system      total        real
cary_set   0.920000   0.030000   0.950000 (  0.942539)
cary_hash  0.630000   0.020000   0.650000 (  0.642367)
amadan     0.470000   0.000000   0.470000 (  0.478658)


Unique elements: 50,000
Number of copies of each unique element: 20
Array size: 1,000,000

       user     system      total        real
cary_set   0.910000   0.020000   0.930000 (  0.932277)
cary_hash  0.570000   0.000000   0.570000 (  0.575439)
amadan     0.410000   0.010000   0.420000 (  0.417695)

Unique elements: 1000000
Number of copies of each unique element: 10
Array size: 10000000

       user     system      total        real
cary_set  12.660000   0.270000  12.930000 ( 12.962183)
cary_hash  7.730000   0.060000   7.790000 (  7.797486)
amadan     6.640000   0.060000   6.700000 (  6.707706)

答案 2 :(得分:1)

为什么不简单?

<强>代码

require 'set'

def extract(arr)
  first = Set.new
  arr.each_with_object([]) do |(a0, a1, *_), b|
    unless first.include?(a0)
      first << a0 
      b << [a0, a1]
    end
  end
end

示例

arr = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
extract(arr)
  #=> [[1, "A"], [2, "B"]]

<强>替代

其中一个变体是:

def extract(arr)
  arr.each_with_object({}) { |(a0, a1, *_), h|
    h[a0]=[a0, a1] unless h.key?(a0) }.values
end

我希望性能大致相同,但哈希使用更多内存,因为values