如果R数据中满足条件,则获取唯一条目

时间:2020-04-09 08:30:59

标签: r data.table

问题:假设我有下面的data.table对象。我只想保留满足以下条件的条目:

  • 对于每个CURRENT_DATEIID,如果在该日期state = final_e上已经有state = inital_e,则仅在行中保留IID。 >
  • 对于每个CURRENT_DATEIID,如果有state = e,它们将不受影响并保留在数据中

任何建议如何做到这一点,以使我得到所需的对象?非常感谢!

library(data.table)

dt <- data.table(
  CURRENT_DATE = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02"),
  IID = c(1, 1, 2, 1, 2, 2),
  state = c("init_e", "final_e", "e", "e", "init_e", "final_e"),
  vals = c(10, 20, 30, 22, 9, 7),
  text = c("some_text1", "some_text2", "some_text3", "some_text4", "some_text5", "some_text6")
)

## Output:
   CURRENT_DATE IID   state vals       text
1:   2020-01-01   1  init_e   10 some_text1
2:   2020-01-01   1 final_e   20 some_text2
3:   2020-01-01   2       e   30 some_text3
4:   2020-01-02   1       e   22 some_text4
5:   2020-01-02   2  init_e    9 some_text5
6:   2020-01-02   2 final_e    7 some_text6

## Desired Output:
  CURRENT_DATE IID   state vals       text
1:   2020-01-01   1 final_e   20 some_text2
2:   2020-01-01   2       e   30 some_text3
3:   2020-01-02   1       e   22 some_text4
4:   2020-01-02   2 final_e    7 some_text6

编辑:

library(data.table)

dt2 <- data.table(
  CURRENT_DATE = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02"),
  IID = c(1, 1, 2, 1, 2),
  state = c("init_e", "final_e", "e", "e", "final_e"),
  vals = c(10, 20, 30, 22, 7),
  text = c("some_text1", "some_text2", "some_text3", "some_text4", "some_text5")
)

## Output:
   CURRENT_DATE IID   state vals       text
1:   2020-01-01   1  init_e   10 some_text1
2:   2020-01-01   1 final_e   20 some_text2
3:   2020-01-01   2       e   30 some_text3
4:   2020-01-02   1       e   22 some_text4
5:   2020-01-02   2 final_e    7 some_text5

使用这些数据,答案之一将导致

setorder(dt2[, rn := .I], CURRENT_DATE, IID, state)
dt2[sort(c(dt2[state=="e", which=TRUE],
          unique(dt2[state %chin% c("final_e","init_e")], by=c("CURRENT_DATE","IID"))$rn))]

## Output:
   CURRENT_DATE IID   state vals       text rn
1:   2020-01-01   1  init_e   10 some_text1  1
2:   2020-01-01   2       e   30 some_text3  3
3:   2020-01-02   1       e   22 some_text4  4
4:   2020-01-02   2 final_e    7 some_text5  5

## Desired Output:
   CURRENT_DATE IID   state vals       text
1:   2020-01-01   1 final_e   20 some_text2
3:   2020-01-01   2       e   30 some_text3
4:   2020-01-02   1       e   22 some_text4
5:   2020-01-02   2 final_e    7 some_text5

3 个答案:

答案 0 :(得分:2)

这是另一种选择:

setkey(dt, CURRENT_DATE, IID, state)[, rn := .I]
dt[sort(c(dt[state=="e", which=TRUE],
    unique(dt[state %chin% c("final_e","init_e")], by=c("CURRENT_DATE","IID"))$rn))]

或者仅基于小型样本数据集:

dt[state!="init_e"]

答案 1 :(得分:0)

我们可以编写一个自定义函数:

check_condition <- function(state) {
     if (any(state == "init_e")) which(state == 'final_e')
     else if(state == 'e') which(state == 'e')
}

并将其应用于每个组。

library(data.table)
dt[, .SD[check_condition(state)], .(CURRENT_DATE, IID)]

#   CURRENT_DATE IID   state vals       text
#1:   2020-01-01   1 final_e   20 some_text2
#2:   2020-01-01   2       e   30 some_text3
#3:   2020-01-02   1       e   22 some_text4
#4:   2020-01-02   2 final_e    7 some_text6

答案 2 :(得分:0)

让我也回答我自己的问题,因为我找到了一个漂亮的(显而易见的)解决方案:

  • 本质上,我首先对data.table对象进行排序,然后在其上获取所有唯一元素(CURRENT_DATE, IID
  • “技巧”是将state变量编码为有序因子
dt2[, state := factor(state, levels = c("final_e", "init_e", "e"),
                      ordered = TRUE)]

sorted_frame <- dt2[order(CURRENT_DATE, IID, state)]
u_frame <- unique(sorted_frame, by = c("CURRENT_DATE", "IID"))
相关问题