根据唯一值和其他列数据对数据帧进行子集

时间:2017-09-25 18:00:44

标签: r dataframe data.table

我有一个包含一系列ID字符(trt,个人和会话)的数据框:

> trt<-c(rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3))
individual<-rep(c("Bob","Nancy","Tim"),9)
session<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
data<-rnorm(27,mean=4,sd=1)
df<-as.data.frame(cbind(trt,individual,session,data))
df
   trt individual session             data
1    A        Bob       1 4.36604594311893
2    A      Nancy       1 3.29568979189961
3    A        Tim       1 3.55849387209243
4    B        Bob       2 5.41661201729216
5    B      Nancy       2  4.7158873476798
6    B        Tim       2 5.34401708530548
7    C        Bob       3 4.54277206331273
8    C      Nancy       3 3.53976115781019
9    C        Tim       3  3.7954788384957
10   A        Bob       4 4.75145309337952
11   A      Nancy       4  4.7995601464568
12   A        Tim       4 3.17821205815185
13   B        Bob       5 3.62379779744325
14   B      Nancy       5 4.07387328854209
15   B        Tim       5 5.60156909861945
16   C        Bob       6 4.06727142161431
17   C      Nancy       6 4.59940289933985
18   C        Tim       6 3.07543217234973
19   A        Bob       7 2.63468285023662
20   A      Nancy       7 3.22650587327078
21   A        Tim       7 6.31062631711196
22   B        Bob       8 4.69047076193906
23   B      Nancy       8 4.79190101388308
24   B        Tim       8 1.61906440409175
25   C        Bob       9 2.85180524036416
26   C      Nancy       9 3.43304058627408
27   C        Tim       9 4.89263600498695

我希望创建一个新的数据框,我随机抽取每个trtxindividual组合,但在约束下,每个唯一的会话编号只选择一次

这就是我想要的数据帧:

    trt individual session             data
2    A      Nancy       1 3.29568979189961
4    B        Bob       2 5.41661201729216
9    C        Tim       3  3.7954788384957
10   A        Bob       4 4.75145309337952
15   B        Tim       5 5.60156909861945
17   C      Nancy       6 4.59940289933985
21   A        Tim       7 6.31062631711196
23   B      Nancy       8 4.79190101388308
25   C        Bob       9 2.85180524036416

我知道如何随机选择每个trtx个人组合的子集:

> setDT(df)
newdf<-df[, .SD[sample(.N, 1)] , by=.(trt, individual)]
newdf
  trt individual session             data
1:   A        Bob       4 4.75145309337952
2:   A      Nancy       1 3.29568979189961
3:   A        Tim       7 6.31062631711196
4:   B        Bob       8 4.69047076193906
5:   B      Nancy       **2**  4.7158873476798
6:   B        Tim       **2** 5.34401708530548
7:   C        Bob       6 4.06727142161431
8:   C      Nancy       9 3.43304058627408
9:   C        Tim       3  3.7954788384957

但是我不知道如何限制拉动只允许拉一个会话(也就是说不允许重复,如上所述)

提前感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

这需要遍历data.table并且可能不会很快,但它并不需要为感兴趣的字段设置任何参数

library(data.table)
set.seed(7)

setDT(df)
dt1 <- df[, .SD[sample(.N)]]
dt1[, i := .I]
dt1[, flag := NA]
setkey(dt1, flag)

lapply(dt1$i, function(x) {
  dt1[is.na(flag[x]) & (trt == trt[x] & individual == individual[x] | session == session[x]), flag := i == x]
})

dt1[flag == TRUE, ]

   trt individual session             data  i flag
1:   C        Tim       9 3.63712332100071  1 TRUE
2:   A      Nancy       4 4.54908662150973  2 TRUE
3:   A        Tim       1 5.84217708521442  3 TRUE
4:   B        Tim       2 2.37343483362789  5 TRUE
5:   C      Nancy       3 2.87792051390258  7 TRUE
6:   A        Bob       7 3.45471592963754 12 TRUE
7:   B      Nancy       8 4.54792567807183 15 TRUE
8:   C        Bob       6 4.45667777212948 24 TRUE
9:   B        Bob       5 2.33285598638319 27 TRUE