如何按列和行汇总数据框架?

时间:2018-11-09 11:44:01

标签: r

我有以下数据集:

Class   Total   AC  Final_Coverage
A   1000        1   55
A   1000        2   66
B   1000        1   77
A   1000        3   88
B   1000        2   99
C   1000        1   11
B   1000        3   12
B   1000        4   13
B   1000        5   22
C   1000        2   33
C   1000        3   44
C   1000        4   55
C   1000        5   102
A   1000        4   105
A   1000        5   109

我想获取每个类的前三行的ACFinal_Coverage的平均值。然后,我想将平均值和类名一起存储在新的数据框中。为此,我执行了以下操作:

dataset <- read_csv("/home/ad/Desktop/testt.csv")

classes <- unique(dataset$Class)
new_data <- data.frame(Class = character(0), AC = numeric(0), Coverage = numeric(0))

for(class in classes){
  new_data$Class <- class
  dataClass <- subset(dataset, Class == class)

  tenRows <- dataClass[1:3,]

  coverageMean <- mean(tenRows$Final_Coverage)
  acMean <- mean(tenRows$AC)

  new_data$Coverage <- coverageMean
  new_data$AC <- acMean
}

除了在new_data框中输入平均值之外,其他所有方法都可以正常工作。我收到以下错误:

Error in `$<-.data.frame`(`*tmp*`, "Class", value = "A") : 
  replacement has 1 row, data has 0

你知道如何解决这个问题吗?

2 个答案:

答案 0 :(得分:2)

这应该通过使用dplyr为您提供新的数据框。

dataset %>% group_by(Class) %>% slice(1:3) %>% summarise(AC= mean(AC),
                                                           Coverage= mean(Final_Coverage))

在您的方法中,错误是您用0行启动了新的数据框,并尝试为其分配单个值。错误反映了这一点。您想要将一行替换为0行的数据框。不过,这将起作用:

new_data <- data.frame(Class = classes, AC = NA, Coverage = NA)

for(class in classes){
 new_data$Class <- class
 dataClass <- subset(dataset, Class == class)

 tenRows <- dataClass[1:3,]

 coverageMean <- mean(tenRows$Final_Coverage)
 acMean <- mean(tenRows$AC)

 new_data$Coverage[classes == class] <- coverageMean
 new_data$AC[classes == class] <- acMean
}

答案 1 :(得分:1)

您可以查看aggregate()

> aggregate(df1[df1$AC <= 3, 3:4], by=list(Class=df1[df1$AC <= 3, 1]), FUN=mean)
  Class AC Final_Coverage
1     A  2       69.66667
2     B  2       62.66667
3     C  2       29.33333

数据

df1 <- structure(list(Class = structure(c(1L, 1L, 2L, 1L, 2L, 3L, 2L, 
                                          2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"), 
                      Total = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 
                                1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L), 
                      AC = c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 4L, 5L, 2L, 3L, 4L, 5L, 
                             4L, 5L), Final_Coverage = c(55L, 66L, 77L, 88L, 99L, 11L, 
                                                         12L, 13L, 22L, 33L, 44L, 55L, 102L, 105L, 109L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                               -15L))