Question

背景

我对data.table库相当新，目前正在学习有效地使用它。我在这里有两个表，我想先聚合第二个，然后将它与第一个合并，并修改连接表中的列。理想地（并且为了我的理解）一气呵成。

套餐版

sessionInfo()
# R version 3.1.0 (2014-04-10)
# Platform: i386-w64-mingw32/i386 (32-bit)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.4

# loaded via a namespace (and not attached):
# [1] chron_2.3-45  plyr_1.8.1    Rcpp_0.11.2   reshape2_1.4  stringr_0.6.2
# [6] tools_3.1.0

代码

在这个最小的例子中可以看到我所尝试的内容：

library(data.table)
set.seed(1)
DT1 <- data.table(id = LETTERS[1:4], x = rnorm(4), key = "id")
DT2 <- data.table(id = rep(LETTERS[1:4], each = 3), y = 1:12, z = rep(1, 12), key = "id")
DT1[DT2[, lapply(.SD, mean), by = "id"]] # simple join works fine
#    id         x  y z
# 1:  A -0.6264538  2 1
# 2:  B  0.1836433  5 1
# 3:  C -0.8356286  8 1
# 4:  D  1.5952808 11 1

# however, adding a 'j' argument does not work
DT1[DT2[, lapply(.SD, mean), by = "id"], x := -x] # (1)

# in fact the above statement changes the 'x' column in 'DT1':
DT1
#    id          x
# 1:  A  0.6264538
# 2:  B -0.1836433
# 3:  C  0.8356286
# 4:  D -1.5952808

我想这与data.table如何处理数据的智能方式有关（除非需要，否则不会复制，因此通过引用调用）。因此，以下代码有效：

DT3 <- copy(DT1[DT2[, lapply(.SD, mean), by = "id"]])[, x := -x]
(DT4 <- DT1[DT2[, lapply(.SD, mean), by = "id"]][, x := -x]) # (2)
#    id          x  y z
# 1:  A -0.6264538  2 1
# 2:  B  0.1836433  5 1
# 3:  C -0.8356286  8 1
# 4:  D  1.5952808 11 1
identical(DT3, DT4)
# [1] TRUE

问题

什么是最好的＆＃39;这样做的方式？＆＃39;最佳＆＃39;在使用的时间和记忆方面？
这样做的概念方法是什么？换句话说，Matt Dowle（软件包维护者）会使用什么系列命令？
为什么(1)在(2)按预期工作时无效？

Answer 1

您当前实施的问题（1）

DT1[DT2[, lapply(.SD, mean), by = "id"], x := -x] # (1)

您要通过DT1引用修改x:=-x，实际上并未分配DT2[,...]的加入。

你想要的是（4）

 DT3 <- DT1[DT2[, lapply(.SD, mean), by = "id"]][, x := -x]

此处，对已加入数据集的[的额外调用意味着您在新创建的data.table中分配x:=-x。

除非你真的需要，否则不需要显式副本。

Answer 2

以下是我用dplyr解决这个问题的方法：

library("dplyr")

set.seed(1)
DT1 <- data_frame(id = LETTERS[1:4], x = rnorm(4), key = "id")
DT2 <- data_frame(id = rep(LETTERS[1:4], each = 3), y = 1:12, z = rep(1, 12), key = "id")

DT2 %>% 
  group_by(id) %>% 
  summarise_each(funs(mean), y:z) %>%
  left_join(DT1) %>% 
  mutate(x = -x)

（假设我已正确解释您的data.table代码）

一次加入并添加列

2 个答案: