我有这样的数据集:
df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)
我想聚合df $ vals并带来其他变量
目前我这样做:
multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}
aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)
aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)
result<-merge(aggDescrip,aggVals)
这很好用,但我想知道是否有更简单的方法。
由于
答案 0 :(得分:3)
也许您应该查看data.table
包。
library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)),
country = unique(as.character(country)),
n = .N, min = min(vals), max = max(vals),
mean = mean(vals)), by=key(DT)]
# sp type country n min max mean
# 1: 100 C A 3 1 3 2
# 2: 101 H B 3 4 6 5
# 3: 102 C C 3 7 9 8
如果你想坚持使用基础R,这里有另一种可能有用的方法(虽然aggregate
可能更常见):
unique(within(df, {
mean <- ave(vals, sp, FUN=mean)
max <- ave(vals, sp, FUN=max)
min <- ave(vals, sp, FUN=min)
n <- ave(vals, sp, FUN=length)
rm(vals)
}))
# sp type country n min max mean
# 1 100 C A 3 1 3 2
# 4 101 H B 3 4 6 5
# 7 102 C C 3 7 9 8
如果可能的话,我建议坚持使用data.table
,因为生成的代码很容易理解,聚合过程很快。
然而,通过一些修改,你可以(另一种)基础R方法更直接。
首先,修改您的功能,以便使用c()
而不是data.frame
。另外,添加一个参数,指定需要聚合哪个列。
multi.func <- function(x, value_column) {
data.frame(
n = length(x[[value_column]]),
min = min(x[[value_column]], na.rm=TRUE),
max = max(x[[value_column]], na.rm=TRUE),
mean = mean(x[[value_column]], na.rm=TRUE))
}
其次,在数据集上使用lapply
,在分组变量split
上使用原始数据集merge
输出,并返回unique
值。
unique(merge(df[-4],
do.call(rbind, lapply(split(df, df$sp),
multi.func, value_column = "vals")),
by.x = "sp", by.y = "row.names"))
答案 1 :(得分:2)
仅使用aggregate
:
result <- aggregate(vals ~ type + sp + country, df,
function(x) c(length(x), min(x), max(x), mean(x))
)
result
type sp country vals.1 vals.2 vals.3 vals.4
1 C 100 A 3 1 3 2
2 H 101 B 3 4 6 5
3 C 102 C 3 7 9 8
colnames(result)
[1] "type" "sp" "country" "vals"
以上似乎创造了一个奇怪的“多值”专栏。但summaryBy
包中的doBy
与aggregate
类似,但允许包含多列的输出:
library(doBy)
result <- summaryBy(vals ~ type + sp + country, df,
FUN=function(x) c(n=length(x), min=min(x), max=max(x), mean=mean(x))
)
result
type sp country vals.n vals.min vals.max vals.mean
1 C 100 A 3 1 3 2
2 C 102 C 3 7 9 8
3 H 101 B 3 4 6 5
colnames(result)
[1] "type" "sp" "country" "vals.n" "vals.min" "vals.max"
[7] "vals.mean"