Question

我在Ruby中有Array个String个对象，它们由下面的单词构成：

animals = ["cat horse", "dog", "cat dog bird", "dog sheep", "chicken cow"]

我想将其转换为Array个String个对象，但每个元素只有一个动物，只有唯一的元素。我找到了一种方法，如下所示：

class Array
  def process()
    self.join(" ").split().uniq
  end
end

然而，如果输入数组是巨大的，让我们说数百万条目，那么这将是非常糟糕的因为我将创建一个巨大的字符串，然后是一个巨大的数组，然后uniq必须处理用于删除重复元素的巨大数组。我考虑加快速度的一种方法是创建一个Hash，每个单词都有一个条目，这样我只会在第一遍中处理每个单词一次。还有更好的方法吗？

Answer 1

你有正确的想法。但是，Ruby有一个内置的类，非常适合构建一组独特的项目：Set。

animals = ["cat horse", "dog", "cat dog bird", "dog sheep", "chicken cow"]

unique_animals = Set.new

animals.each do |str|
  unique_animals.merge(str.split)
end
# => cat
#    horse
#    dog
#    bird
#    sheep
#    chicken
#    cow

或者...

unique_animals = animals.reduce(Set.new) do |set, str|
  set.merge(str.split)
end

在封面下设置actually uses a Hash以存储其项目，但它更像是一个无序数组，并响应所有熟悉的可枚举方法（each，map，{ {1}}等）。但是，如果您需要将其转换为真实数组，请使用Set#to_a。

Answer 2

令人惊讶的（也许），我认为你不会比现在的代码更快。我认为您的代码同时是最快且最易读的。原因如下：您的代码表达了一个非常好的高级算法，可以直接映射到Ruby高级方法。这些方法经过优化和编译。祝你在纯Ruby中实现更快的速度。在任何情况下，我都不是Ruby大师，我非常有兴趣在合理大小的阵列上看到更高效的解决方案。

Jordan和Nathaniel实现了更精细的解决方案，并且“手动”迭代地处理输入数组。虽然这个可能使用更少的内存，但它不会像Ruby的uniq那么快。但是，如果您遇到大型阵列的内存问题（或达到某个阈值时遇到性能问题），当然您应该考虑实现这些内容的变体。这是我的：

def process
  distincts = Hash.new
  self.each { |words| words.split.each { |word| distincts[word] = nil }}
  distincts.keys
end

这是Jordan的解决方案，使用Hash而不是Set。这就是你打算使用的。直接使用Hash将消除维护Set的开销（或者我认为），并且应该明显更快。稍微更快的解决方案可能是：

def process
  distincts = Hash.new
  self.each { |words| words.split.each { |word| distincts[word] = :present unless distincts[word] }}
  distincts.keys
end

同样，我不确定（对不起，我现在无法轻易测试所有这些）。无论如何，我怀疑这两个中的一个更接近原始代码的表现，但我怀疑它会克服它（再次，直到你达到一定的输入大小）。

Answer 3

为什么不自己处理每个数组元素？

for each element in [...]
  if the element does not contain spaces
    insert it into the result array
  else
    split it up and insert its parts in the next position ahead
  end
end

以下是ruby实现：

class Array
  def process
    d = dup
    d.each_with_object([]).each_with_index do |(element, array), index|
      if !element.index " "
        array << element if !array.include? element
      else
        d.insert index+1, *(element.split)
      end
    end
  end
end

["cat horse", "dog", "cat dog bird", "dog sheep", "chicken cow"].process
=> ["cat", "horse", "dog", "bird", "sheep", "chicken", "cow"]

优点：

您不必处理长字符串
尽可能接近线性时间（见缺点）
维护元素顺序

缺点：

比线性时间稍慢（由于字符串被拆分并向前插入）

也就是说，它比join(" ").split().uniq（更少的循环）快得多。但从实际意义上来说，它更快，而不是科学意义上的。

Answer 4

我已经尝试过其他人在这里建议的各种方法，但是我想出了两个比其他人建议的速度更快但不如原来不快的方法。

  # This one moves through the original Array using inject to process
  # each element containing space-separated words and appending them
  # to a new array.  Finally uniq is called to remove duplicate words
  def process_new_4
    self.inject([]) {
        |array, words|
      array.push(*words.split)
    }.uniq
  end

  # This one uses the flat_map method of Array to flatten itself, each
  # element is split in case it contains more than one word, then the
  # flattened array has duplicate elements removed with uniq
  def process_new_3
    self.flat_map(&:split).uniq
  end

我可以更快地制作这个Ruby代码和/或使用更少的内存吗？

4 个答案: