Ruby:对哈希数组进行group_by操作

时间:2017-06-02 03:07:39

标签: ruby

我有一系列哈希表示存储在盒子中的化合物。

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound1", "Box"=>3},
            {"Name"=>"Compound2", "Box"=>3},
            {"Name"=>"Compound3", "Box"=>3},
            {"Name"=>"Compound7", "Box"=>4}]

我想选择阵列的一个子集,最小的是盒子数,它覆盖了化合物的完整清单(即1到7)。因此结果将是:

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound7", "Box"=>4}]

我可以使用以下方法对每盒化合物进行分组:

database.group_by{|x| x['Box']}

我无法减少结果,以便从分组操作中删除重复的化合物名称。

4 个答案:

答案 0 :(得分:2)

使用Ruby> = 2.4,我们可以使用transform_values

database.group_by { |hash| hash["Name"] }
        .transform_values { |v| v.min_by { |h| h["Box"] } }
        .values

或者如果你有Ruby< 2.4你可以这样做:

database.group_by {|hash| hash["Name"] }.map { |_,v| v.min_by {|h| h["Box"]} }

主要方法: group_bytransform_values(Ruby> 2.4)和min_by。有关详细信息,请参阅Ruby Docs

答案 1 :(得分:1)

您可以尝试使用Array#uniq

database = [{name: "Compound1", box: 1}, {name: "Compound2", box: 1}, {name: "Compound2", box: 1}, {name: "Compound3", box: 1}, {name: "Compound4", box: 1}, {name: "Compound5", box: 2}, {name: "Compound6", box: 2}, {name: "Compound1", box: 3}, {name: "Compound2", box: 3}, {name: "Compound3", box: 3}, {name: "Compound7", box: 4}]

p database.uniq{|k,_v| k[:name]}
# =>  [
#   {:name=>"Compound1", :box=>1}, 
#   {:name=>"Compound2", :box=>1}, 
#   {:name=>"Compound3", :box=>1}, 
#   {:name=>"Compound4", :box=>1}, 
#   {:name=>"Compound5", :box=>2}, 
#   {:name=>"Compound6", :box=>2}, 
#   {:name=>"Compound7", :box=>4}
# ]

或者:

p database.group_by{|k,_v| k[:box]}.each{|_k,v| v.uniq!{|k,_v| k[:name]}}

# => {
#   1=>[
#     {:name=>"Compound1", :box=>1},
#     {:name=>"Compound2", :box=>1},
#     {:name=>"Compound3", :box=>1},
#     {:name=>"Compound4", :box=>1}
#   ], 
#   2=>[
#     {:name=>"Compound5", :box=>2}, 
#     {:name=>"Compound6", :box=>2}
#   ],
#   3=>[
#     {:name=>"Compound1", :box=>3},
#     {:name=>"Compound2", :box=>3},
#     {:name=>"Compound3", :box=>3}
#   ],
#   4=>[
#     {:name=>"Compound7", :box=>4}
#   ]
# }

答案 2 :(得分:1)

问题的实质是找到一个最小尺寸的盒子组合,其中包括(“覆盖”)所有一组指定的“组件”。然后使用这些框的组合来计算感兴趣的对象,如下所示。

<强>代码

def min_box(database, coverage)
  boxes_to_compounds = database.each_with_object(Hash.new {|h,k| h[k]=[]}) { |g,h|
    h[g["Box"]] << g["Name"] }
  boxes = boxes_to_compounds.keys
  (1...boxes.size).each do |n|
    boxes.combination(n).each do |combo| return combo if
      (coverage - combo.flat_map { |box| boxes_to_compounds[box] }).empty? 
    end
  end
  nil
end

coverage是给定的一系列所需化合物(例如“化合物3”)。

示例

假设我们在问题中给出database并且

coverage = ["Compound1", "Compound2", "Compound3", "Compound4",
            "Compound5", "Compound6", "Compound7"] 

然后发现盒子的最佳组合是

combo = min_box(database, coverage)
  #=> [1, 2, 4]

我们现在可以计算database所需的元素数组:

database.select { |h| combo.include?(h["Box"]) }.uniq
  #=> [{"Name"=>"Compound1", "Box"=>1}, {"Name"=>"Compound2", "Box"=>1},
  #    {"Name"=>"Compound3", "Box"=>1}, {"Name"=>"Compound4", "Box"=>1},
  #    {"Name"=>"Compound5", "Box"=>2}, {"Name"=>"Compound6", "Box"=>2},
  #    {"Name"=>"Compound7", "Box"=>4}] 

min_box解释

找到最佳的盒子组合是一个很难(NP完全)的问题。因此,需要某种形式的盒子组合的枚举。我首先确定一个盒子是否提供了所需的组件覆盖范围。如果其中一个框出现,则找到最佳解决方案,并且该方法返回包含该框的数组。如果没有一个盒子覆盖所有化合物,我会看两个盒子的所有组合。如果其中一个组合提供了所需的覆盖范围,则它是一个最佳解决方案,并返回这些框的数组;否则考虑三个盒子的组合。最终找到了一个最佳组合,或者得出的结论是,所有框都不能提供所需的覆盖范围,在这种情况下,该方法返回nil

对于上面的例子,计算如下。

boxes_to_compounds = database.each_with_object(Hash.new {|h,k| h[k]=[]}) { |g,h|
  h[g["Box"]] << g["Name"] }
  #=> {1=>["Compound1", "Compound2", "Compound2", "Compound3", "Compound4"],
  #    2=>["Compound5", "Compound6"],
  #    3=>["Compound1", "Compound2", "Compound3"],
  #    4=>["Compound7"]}
boxes = boxes_to_compounds.keys
  #=> [1, 2, 3, 4]
boxes.size
  #=> 4

每个元素1...boxes.size都会传递到外部each块。考虑框3

n = 3
e = boxes.combination(n)
  #=> #<Enumerator: [1, 2, 3, 4]:combination(3)> 

我们可能会看到这个枚举器生成的对象,并通过将其转换为数组传递给内部each块。

e.to_a
  #=> [[1, 2, 3], [1, 2, 4], [1, 3, 4], [2, 3, 4]] 

e生成的第一个元素将传递给块,并计算以下内容。

combo = e.next
  #=> [1, 2, 3]
a = combo.flat_map { |box| boxes_to_compounds[box] }
  #=> ["Compound1", "Compound2", "Compound2", "Compound3", "Compound4",
  #    "Compound5", "Compound6", "Compound1", "Compound2", "Compound3"] 
b = coverage - a  
  #=> ["Compound7"] 
b.empty?
  #=> false 

由于盒子的组合不包括“Compound7”,我们按下并将e生成的下一个元素传递给块。

combo = e.next
  #=> [1, 2, 4] 
a = combo.flat_map { |box| boxes_to_compounds[box] }
  #=> ["Compound1", "Compound2", "Compound2", "Compound3", "Compound4",
  #    "Compound5", "Compound6", "Compound7"] 
b = coverage - a  
  #=> [] 
b.empty?
  #=> true 

因此,我们找到了方法返回的框[1, 2, 4]的最佳组合。

答案 3 :(得分:0)

我不喜欢原始数据结构。为什么不从{CompoundX => BoxY}哈希开始,因为"Name""Box"并不真正有用。但如果你和那个结构结婚了,我就会这样做:

database = [{"Name"=>"Compound1", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound2", "Box"=>1},
            {"Name"=>"Compound3", "Box"=>1},
            {"Name"=>"Compound4", "Box"=>1},
            {"Name"=>"Compound5", "Box"=>2},
            {"Name"=>"Compound6", "Box"=>2},
            {"Name"=>"Compound1", "Box"=>3},
            {"Name"=>"Compound2", "Box"=>3},
            {"Name"=>"Compound3", "Box"=>3},
            {"Name"=>"Compound7", "Box"=>4}]

new_db_arr = database.collect{|h| h.flatten}.flatten.collect{|i| i if i != "Name" && i != "Box"}.compact!
new_db_hash = {}
new_db_arr.each_slice(2) do |a,b|
  if new_db_hash[a].nil?
    new_db_hash[a] = []
  end
  new_db_hash[a] << b
end

new_db_hash
boxes = new_db_hash.values
combos = boxes[0].product(*boxes[1..-1])
combos = combos.sort_by{|a| a.uniq.length }
winning_combo = combos[0].uniq

大部分工作只是将数据结构转换为:Compound => boxNumber格式的哈希值。然后生成每个框的组合,按组合的uniq项的数量排序,并将具有最小数量的uniq项的一个作为答案。不确定这对于非常大的数据集来说有多大。