将聚合值表与“父”数据集中的汇总变量组合在一起

时间:2012-12-17 14:43:02

标签: r

我有这样的数据集:

df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)

我想聚合df $ vals并带来其他变量

目前我这样做:

multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}

aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)

aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)

result<-merge(aggDescrip,aggVals)

这很好用,但我想知道是否有更简单的方法。

由于

2 个答案:

答案 0 :(得分:3)

也许您应该查看data.table包。

library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)), 
          country = unique(as.character(country)), 
          n = .N, min = min(vals), max = max(vals), 
          mean = mean(vals)), by=key(DT)]
#     sp type country n min max mean
# 1: 100    C       A 3   1   3    2
# 2: 101    H       B 3   4   6    5
# 3: 102    C       C 3   7   9    8

如果你想坚持使用基础R,这里有另一种可能有用的方法(虽然aggregate可能更常见):

unique(within(df, {
    mean <- ave(vals, sp, FUN=mean)
    max <- ave(vals, sp, FUN=max)
    min <- ave(vals, sp, FUN=min)
    n <- ave(vals, sp, FUN=length)
    rm(vals)
}))
#    sp type country n min max mean
# 1 100    C       A 3   1   3    2
# 4 101    H       B 3   4   6    5
# 7 102    C       C 3   7   9    8

更新:初次尝试的变体

如果可能的话,我建议坚持使用data.table,因为生成的代码很容易理解,聚合过程很快。

然而,通过一些修改,你可以(另一种)基础R方法更直接。

首先,修改您的功能,以便使用c()而不是data.frame。另外,添加一个参数,指定需要聚合哪个列。

multi.func <- function(x, value_column) {
    data.frame(
        n = length(x[[value_column]]),
        min = min(x[[value_column]], na.rm=TRUE),
        max = max(x[[value_column]], na.rm=TRUE),
        mean = mean(x[[value_column]], na.rm=TRUE))
}

其次,在数据集上使用lapply,在分组变量split上使用原始数据集merge输出,并返回unique值。

unique(merge(df[-4], 
             do.call(rbind, lapply(split(df, df$sp), 
                                   multi.func, value_column = "vals")),
             by.x = "sp", by.y = "row.names"))

答案 1 :(得分:2)

仅使用aggregate

result <- aggregate(vals ~ type + sp + country, df, 
    function(x) c(length(x), min(x), max(x), mean(x))
)

result
  type  sp country vals.1 vals.2 vals.3 vals.4
1    C 100       A      3      1      3      2
2    H 101       B      3      4      6      5
3    C 102       C      3      7      9      8

colnames(result)
[1] "type"    "sp"      "country" "vals"  

以上似乎创造了一个奇怪的“多值”专栏。但summaryBy包中的doByaggregate类似,但允许包含多列的输出:

library(doBy)
result <- summaryBy(vals ~ type + sp + country, df, 
    FUN=function(x) c(n=length(x), min=min(x), max=max(x), mean=mean(x))
)

result
  type  sp country vals.n vals.min vals.max vals.mean
1    C 100       A      3        1        3         2
2    C 102       C      3        7        9         8
3    H 101       B      3        4        6         5

colnames(result)
[1] "type"      "sp"        "country"   "vals.n"    "vals.min"  "vals.max" 
[7] "vals.mean"