Question

样本数据

data =data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4),
                 score=c(5,7,6,9,8,4,NA,11,3,7,NA,10))

因此在此示例中，如果id的分数等于7，则我想删除这些id以获得新的数据框，例如：

data2 =data.frame(id=c(2,2,2,3,3,3),
                     score=c(9,8,4,NA,11,3))

我尝试了data[data$score != 7,]，但这仅适用于一行，不适用于该组。

Answer 1

使用subset保留!any(x == 7, na.rm = TRUE)为TRUE的任何组。这种单线仅使用基数R。

subset(data, !ave(score, id, FUN = function(x) any(x == 7, na.rm = TRUE)))

给予：

Answer 2

如果您想要不需要任何软件包的解决方案，则可以尝试：

data[!(data$id %in% data$id[data$score == 7]) , ]


  id score
4  2     9
5  2     8
6  2     4
7  3    NA
8  3    11
9  3     3

为了说明一点，当data$id[data$score == 7]为id时，score位会找到7。然后，当原始数据帧中的%in%是id的其中之一时，我们使用data$id %in% data$id[data$score == 7]查找逻辑向量。然后，用!将其包围，以删除那些id。

这里的杀伤力可能很高，但是我们可以对到目前为止提出的所有选择进行基准测试：

library(dplyr)
library(microbenchmark)

microbenchmark(`G. Grothendieck` = subset(data, !ave(score, id, FUN = function(x) any(x == 7, na.rm = TRUE))), 
           `Nick Criswell` = data[!(data$id %in% data$id[data$score == 7]) , ],
           divibisan = data %>%
             group_by(id) %>%
             filter(!(7 %in% score)),
           arg0naut = data %>%
             anti_join(data %>% filter(score == 7), by = "id"),
           tmfmnk = data %>%
             group_by(id) %>%
             filter(!any(score == 7, na.rm = TRUE)),
           `d.b` = data[!data$id %in% split(data$id, data$score)$`7`,])


     Unit: microseconds
            expr     min      lq     mean   median       uq       max neval
 G. Grothendieck 160.001 177.455 189.4648 185.4545 195.6370   269.576   100
   Nick Criswell  37.819  45.091  52.2820  53.8190  57.2130    93.576   100
       divibisan 443.636 456.000 480.1211 464.0000 489.4545   904.726   100
        arg0naut 733.091 757.818 806.7143 766.0600 805.3325  1543.755   100
          tmfmnk 444.121 457.939 704.8916 463.0300 479.5150 22332.079   100
             d.b 103.759 114.424 125.3291 122.1825 131.8800   202.182   100

Answer 3

data[!data$id %in% split(data$id, data$score)$`7`,]
#  id score
#4  2     9
#5  2     8
#6  2     4
#7  3    NA
#8  3    11
#9  3     3

Answer 4

我们可以做到：

library(dplyr)

data %>%
  anti_join(data %>% filter(score == 7), by = "id")

输出：

Answer 5

在dplyr中，我们可以在每个组中group_by和filter中是否存在!中的7 score变量：

library(dplyr)
data %>%
    group_by(id) %>%
    filter(!(7 %in% score))

# A tibble: 6 x 2
# Groups:   id [2]
     id score
  <dbl> <dbl>
1     2     9
2     2     8
3     2     4
4     3    NA
5     3    11
6     3     3

Answer 6

另一种dplyr可能性：

data %>%
 group_by(id) %>%
 filter(!any(score == 7, na.rm = TRUE))

     id score
  <dbl> <dbl>
1     2     9
2     2     8
3     2     4
4     3    NA
5     3    11
6     3     3

或者：

data %>%
 group_by(id) %>%
 filter(!any(cumsum(ifelse(is.na(score), 0, score) == 7) >= 1))

或与base相同：

data[!ave(data$score, data$id, FUN = function(x) any(cumsum(ifelse(is.na(x), 0, x) == 7) >= 1)), ]

  id score
4  2     9
5  2     8
6  2     4
7  3    NA
8  3    11
9  3     3

或类似于@G的可能性。格洛腾迪克，但没有subset()：

data[!ave(data$score, data$id, FUN = function(x) any(x == 7, na.rm = TRUE)), ]

如果任何行符合条件，则删除组

6 个答案: