使用countif计算对数据进行分组和汇总

时间:2019-06-13 14:07:43

标签: r dplyr

想象一下,下表名为DT

ID    Path   Status
AA    XXX    Completed
AB    XXX    Completed
AC    XXX    In progress
AD    XYY    Completed
AE    XYY    In progress

我想按路径将此表分组,并计算(1)唯一ID的数量和(2)状态为“已完成”的唯一ID的数量(原始表DT中没有重复的ID)

我尝试了以下代码:

DT_Grouped <- DT %>%
     group_by(Path) %>%
     summarise(CountComplete = sum(DT$Status == "Completed"), Count=n())

这将产生以下结果:

Path   CountComplete   Count
XXX    3               3
XYY    3               2

CountComplete始终给出状态为完成的唯一ID的总数;没有按路径分组。逻辑上是合理的,因为计算是引用原始表而不是分组的数据集。

我应该如何修改代码以使CountComplete根据Path分组?

预先感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

原因是我们获得的是DT$而不是每个组中“状态”值的完整数据集列

sum(DT$Status == "Completed")
     ^^^^

应该是

library(dplyr)
DT_Grouped <- DT %>%
     group_by(Path) %>%
     summarise(CountComplete = sum(Status == "Completed"), Count=n())

DT_Grouped
# A tibble: 2 x 3
#   Path  CountComplete Count
#   <chr>         <int> <int>
#1 XXX               2     3
#2 XYY               1     2

如果它是data.table,则对应的方法将是

library(data.table)
setDT(DT)[, .(CountComplete = sum(Status == "Completed"), Count = .N), by = Path]

数据

DT <- structure(list(ID = c("AA", "AB", "AC", "AD", "AE"), Path = c("XXX", 
"XXX", "XXX", "XYY", "XYY"), Status = c("Completed", "Completed", 
"In progress", "Completed", "In progress")),
class = "data.frame", row.names = c(NA, 
-5L))