我有一个如下所示的数据框:
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
x <- data.frame(data,utility,stage)
head(x)
data utility stage
1 0.9148060 def xyz
2 0.9370754 abc wxy
3 0.2861395 def xyz
4 0.8304476 cde xyz
5 0.6417455 bcd xyz
6 0.5190959 abc xyz
我希望为实用程序和阶段的唯一组合生成累积分布函数。在我的实际应用程序中,我最终将生成大约100个cdfs,但这个随机数据将具有12(4x3)个唯一组合。但我将使用这些cdfs中的每一个数千次,所以我不想每次都在计算cdf。 ecdf()函数完全按照我的意愿工作,除了我需要对它进行矢量化。以下代码不起作用,但它是我正在尝试做的要点:
ecdf_multiple <- function(x)
{
i=0
utilities <- levels(x$utilities)
stages <- levels(x$stages)
for(utility in utilities)
{
for(stage in stages)
{
i <- i + 1
y <- ecdf(x[x$utilities == utility & x$stage == stage,1])
# calculate ecdf for the unique util/stage combo
z[i] <- list(y,utility,stage)
# then assign it to a data element (list, data frame, json, whatever) note-this doesn't actually work
}
}
z # return value
}
所以在运行ecdf_multiple并将其分配给变量后,我会通过传递一个值(我想要cdf),实用程序和阶段来引用该变量。
有没有办法对ecdf函数进行矢量化(或使用/ build另一个),这样我可以多次输出而不需要反复生成分布?
-------补充回应@Pascal的优秀建议.-------
如何将此扩展为采用“n”类别维度的更一般情况?这是我的尝试,基于Pascal的两个维度的情况。请注意我是如何尝试分配“y”的:
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
openclose <- sample(c("open","close"),1000,replace=TRUE)
x <- data.frame(data,utility,stage,openclose)
numlabels <- length(names(x))-1
y <- split(x, list(x[,2:(numlabels+1)]))
l <- lapply(y,function(x) ecdf(x[,"data"]))
#execute
utility <- "abc"
stage <- "xyz"
openclose <- "close"
comb <- paste(utility, stage, openclose, sep = ".")
# call the function
l[[comb]](.25)
在上面的“y”分配期间,我收到以下错误消息:
"Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?"
答案 0 :(得分:1)