合并具有相等和不相等数据的行

时间:2015-04-21 13:58:47

标签: r reshape

我正在努力合并一些凌乱的数据。

我有一个这样的数据框:

df <- data.frame(name = c("A", "A", "B", "B", "C", "C"), 
                 number = c(1, 1, 2, 2, 3, 3), 
                 product = c("fixed", "variable", "aggregate", "variable", "fixed", "fixed"), 
                 vol = c(1, 9, 2, 6, 4, 7)
                 )

以下是我正在努力的方向:

result <- data.frame(name = c("A", "B", "C"), 
                     number = c(1, 2, 3), 
                     new_product = c("fixed variable", "aggregate variable", "fixed"), 
                     vol = c(10, 8, 11) 
                     )

我的问题是我需要合并数据框中的所有相等行。如果它们不是唯一的,我需要将它们合并为一个名称,如结果中的名称。

我已尝试过dplyr,但在dplyr中我无法以任何有意义的方式获取new_product,因为我无法再次引用相同的列。

df %>% group_by(name) %>% summarize (name = name, 
number = number, 
newproduct = paste(product, product) # ???? 

任何帮助非常感谢!

4 个答案:

答案 0 :(得分:7)

以下是我如何使用data.table解决此问题,但我不确定您如何定义number

library(data.table)
result <- setDT(df)[,.(new_product = toString(unique(product)), vol = sum(vol)), by = name]
result[, number := .I]
result
#    name         new_product vol number
# 1:    A     fixed, variable  10      1
# 2:    B aggregate, variable   8      2
# 3:    C               fixed  11      3

注意:如果您更喜欢输出,可以使用paste(unique(product), collapse = " ")代替toString

或类似于dplyr

df %>% 
  group_by(name) %>% 
  summarise(new_product = toString(unique(product)), vol=sum(vol)) %>% 
  mutate(number = row_number())

答案 1 :(得分:3)

以下是两种更纯粹的基本方式:

df <- data.frame(name = c("A", "A", "B", "B", "C", "C"), 
                 number = rep(1:3, times = 2, each = 1), 
                 product = c("fixed", "variable", "aggregate", "variable", "fixed", "fixed"), 
                 vol = c(1, 9, 2, 6, 4, 7)
)
  1. 这个只是使用ave对原始数据框架进行操作,然后删除重复项

  2. within(df, {
      new_product <- ave(seq_along(name), name, FUN = function(x) 
        toString(unique(df[x, 'product'])))
      vol <- ave(vol, name, FUN = sum)
      product <- NULL
    })[!duplicated(df$name), ]
    
    #   name number vol         new_product
    # 1    A      1  10     fixed, variable
    # 3    B      3   8 aggregate, variable
    # 5    C      2  11               fixed
    
    1. 这是一个更圆整的方式,创建new_product aggregate然后匹配回原始,最后再次使用聚合来获得按组的总和

    2. (tmp <- aggregate(product ~ name, df, function(x)
        paste0(unique(x), collapse = ' ')))
      #   name            product
      # 1    A     fixed variable
      # 2    B aggregate variable
      # 3    C              fixed
      
      df$new_product <- tmp[match(df$name, tmp$name), 'product']
      res <- aggregate(vol ~ name + new_product, df, sum)
      within(res[order(res$name), ], {
        number <- 1:nrow(res)
      })
      
      #   name        new_product vol number
      # 3    A     fixed variable  10      1
      # 1    B aggregate variable   8      2
      # 2    C              fixed  11      3
      

答案 2 :(得分:2)

其他人已经回复了,但这里是我的解决方案:

df %>% 
  group_by (name) %>%
  summarise(
    new_product = paste (unique(product), collapse=" "),
    vol = sum(vol)) %>%
  mutate(number = row_number()) %>%
  select(name, number, new_product, vol)

答案 3 :(得分:1)

基础R加点咖喱:

library(functional)

aggregStrFunc = Compose(unique, Curry(paste, collapse=','))

setNames(cbind(
    aggregate(df$vol, by=list(name=df$name), sum),
    aggregate(df$product, by=list(df$name), aggregStrFunc)[-1]
), c('Name', 'Vol', 'New_Product'))

#  Name Vol        New_Product
#1    A  10     fixed,variable
#2    B   8 aggregate,variable
#3    C  11              fixed