使用data.table在变量上按组查找平均差异

时间:2016-08-11 02:40:48

标签: r data.table

假设我有以下data.table

library(data.table)
dt <- data.table(x1 = c(1:12), x2=c(21:32))

然后,我使用以下内容按用户指定的间隔创建容器:

dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]

返回,

    x1 x2    intx1
 1:  1 21 (-Inf,4]
 2:  2 22 (-Inf,4]
 3:  3 23 (-Inf,4]
 4:  4 24 (-Inf,4]
 5:  5 25    (4,9]
 6:  6 26    (4,9]
 7:  7 27    (4,9]
 8:  8 28    (4,9]
 9:  9 29    (4,9]
10: 10 30 (9, Inf]
11: 11 31 (9, Inf]
12: 12 32 (9, Inf]

我试图找到箱子和变量之间的平均差异:

dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]
dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]

回报是:

    x1 x2 mux1_diff
 1:  1 21      -4.0
 2:  2 22      -4.0
 3:  3 23      -4.0
 4:  4 24      -4.0
 5:  5 25       0.5
 6:  6 26       0.5
 7:  7 27       0.5
 8:  8 28       0.5
 9:  9 29       0.5
10: 10 30       4.5
11: 11 31       4.5
12: 12 32       4.5

但是,我的原始数据包含几个变量(例如,x1,x2,...,x20) 所以,我必须重复x2的相同程序如下:

dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]
dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]
dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]

我的最终输出将是:

    x1 x2 mux1_diff mux2_diff
 1:  1 21      -4.0      -3.5
 2:  2 22      -4.0      -3.5
 3:  3 23      -4.0      -3.5
 4:  4 24      -4.0      -3.5
 5:  5 25       0.5      -3.5
 6:  6 26       0.5       0.5
 7:  7 27       0.5       0.5
 8:  8 28       0.5       0.5
 9:  9 29       0.5       4.0
10: 10 30       4.5       4.0
11: 11 31       4.5       4.0
12: 12 32       4.5       4.0

如何改进此代码?请注意,每个变量都有不同的用户指定间隔

1 个答案:

答案 0 :(得分:2)

我们可以通过紧凑的一步式选项来实现这一点(尽管与OP的方法(来自@Frank&#39;评论)相比,它可能不是最佳选择

dt[, mu_diff := mean(x) - mean(dt$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
#    x    mu_diff
#1:  1 -3.8636364
#2:  2 -3.8636364
#3:  3 -3.8636364
#4:  4 -3.8636364
#5:  5  0.3863636
#6:  6  0.3863636
#7:  7  0.3863636
#8:  9  0.3863636
#9: 10  4.6363636
#10:11  4.6363636
#11:12  4.6363636

如果有很多变量(不清楚我们是否在breaks中使用相同的cut或不同的列 - 假设它是相同的),我们可以遍历列(在下面的可重现示例中,显示了两个变量&#39; x1&#39;&#39; x2&#39;,通过列的数字索引指定.SDcols,按{{1}分组在子集列中,我们将新列指定为组中值cut与整列mean之间的差异。

mean

更新 - 假设每列的nm1 <- paste0("mu_diff", seq_along(dt1)) for(j in seq_along(dt1)){ dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]), by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) , .SDcols = j][] } breaks cut变量与不同,请将其放在list中使用索引在list循环中获取for元素。

brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
  dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]), 
      by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
              .SDcols = j][]
 }

使用OP的新数据检查输出(&#39; dt2&#39;)

brkLst2 <- list(c(-Inf, 4, 9, Inf),  c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
   dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]), 
  by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
          .SDcols = j][]
}

dt2
#    x1 x2 mux1_diff mux2_diff
# 1:  1 21      -4.0      -3.5
# 2:  2 22      -4.0      -3.5
# 3:  3 23      -4.0      -3.5
# 4:  4 24      -4.0      -3.5
# 5:  5 25       0.5      -3.5
# 6:  6 26       0.5       0.5
# 7:  7 27       0.5       0.5
# 8:  8 28       0.5       0.5
# 9:  9 29       0.5       4.0
#10: 10 30       4.5       4.0
#11: 11 31       4.5       4.0
#12: 12 32       4.5       4.0

数据

dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)