通过汇总对行进行值匹配

时间:2018-07-11 11:06:56

标签: r data.table match

在汇总时如何在行之间匹配值?

我有此数据:

library(data.table)
dat<-data.table(group=rep(1,7),code=c("A11",rep("A12",3),"A10","A9","A8"),
               in.out=c(rep("In",4),rep("Out",3)),type=c("car","train","car",rep("train",3),"car"))

  group code in.out  type
     1  A11     In   car
     1  A12     In train
     1  A12     In   car
     1  A12     In train
     1  A10    Out train
     1   A9    Out train
     1   A8    Out   car

我想在每次观察的每个代码级别将in.out =='Out'的类型与in.out =='In'的类型匹配。

例如,我们看到对于代码为A8的观察,类型(汽车)与代码A11的类型匹配。另一方面,对于代码A10,类型(火车)与A11不匹配。理想情况下,我需要创建一个匹配标志(0,1)的列表, 像这样:

group code in.out  type  match
     1  A11     In   car
     1  A12     In train
     1  A12     In   car
     1  A12     In train
     1  A10    Out train  0,1
     1   A9    Out train  0,1
     1   A8    Out   car  1,1

我一直在尝试类似的东西:

dat[ , match := +(type[in.out=='Out'] %in% type[in.out=='In']),by=.(code)]

但是结果不是很正确。我想念什么?

1 个答案:

答案 0 :(得分:0)

OP询问了如何在汇总时在行之间匹配值?
一般的答案是通过加入和随后的聚合

如果我理解正确,那么OP希望在"Out"行相同的"In"行和type行之间找到 all 个匹配项。然后"In"行的代码级别被连续编号,并检查是否找到匹配的级别。

# create numeric observation levels
dat[, obslvl := as.integer(stringr::str_replace(code, "A", ""))]
# order rows for convenience (not required but helps to understand)
setorder(dat, group, lvl)
# store "Out" rows 
dt_out <- dat[in.out == "Out"]
# store "In" rows in separate data.table and number levels contiguously
dt_in <- dat[in.out == "In"][, lvl.rank := frank(lvl, ties.method = "dense"), by = group]
   group code in.out  type lvl lvl.rank
1:     1  A11     In   car  11        1
2:     1  A12     In train  12        2
3:     1  A12     In   car  12        2
4:     1  A12     In train  12        2

现在,我们可以在联接时同时联接两个子表和集合:

tmp <- dt_in[dt_out, on = .(group, type), by = .EACHI, 
             toString(as.integer(sort(lvl.rank) == seq_len(.N)))]
   group  type   V1
1:     1   car 1, 1
2:     1 train 0, 1
3:     1 train 0, 1

V1包含是否在第一个"In"级别,第二个"In"级别等等找到匹配项的标志。结果用于更新dt_out

dt_out[, match := tmp$V1][]
   group code in.out  type lvl match
1:     1   A8    Out   car   8  1, 1
2:     1   A9    Out train   9  0, 1
3:     1  A10    Out train  10  0, 1

最后,根据要求将结果与完整数据集dat结合在一起:

dt_out[dat, on = .(group, code, in.out, type, lvl)]
   group code in.out  type lvl match
1:     1   A8    Out   car   8  1, 1
2:     1   A9    Out train   9  0, 1
3:     1  A10    Out train  10  0, 1
4:     1  A11     In   car  11  <NA>
5:     1  A12     In train  12  <NA>
6:     1  A12     In   car  12  <NA>
7:     1  A12     In train  12  <NA>

有一个快捷方式版本,它仅返回匹配的"In"级别而不创建标志。也许,这有助于更好地理解其内涵:

dt_in <- dat[in.out == "In"]
dt_out <- dat[in.out == "Out"]
dt_out[, matches := dt_in[dt_out, on = .(group, type), by = .EACHI, toString(x.code)]$V1]
dt_out[dat, on = .(group, code, in.out, type)]

   group code in.out  type  matches
1:     1  A11     In   car     <NA>
2:     1  A12     In train     <NA>
3:     1  A12     In   car     <NA>
4:     1  A12     In train     <NA>
5:     1  A10    Out train A12, A12
6:     1   A9    Out train A12, A12
7:     1   A8    Out   car A11, A12
相关问题