r等价于group by with cube

时间:2011-04-20 15:54:01

标签: r

某些sql数据库support with cube操作符group by修饰符。我没有此功能。

基本上,如果我有一个像:

这样的数据集
+------+-----------+---------+---------+
| sum  | source_id | type_id | variety |
+------+-----------+---------+---------+
|  491 |         1 |       1 |       1 |
| 2008 |         1 |       2 |       1 |
|   33 |         1 |       3 |       1 |
|  483 |         1 |       4 |       1 |
|  482 |         1 |       5 |       1 |
|  343 |         1 |       6 |       1 |
| 4979 |         4 |       5 |       1 |
|  303 |         5 |       1 |       1 |
|  443 |         5 |       1 |       2 |
| 1295 |         5 |       2 |       1 |
...

我想将其导入到r中的数据框中,并为(source_id,type_id和variety)的所有子排列生成组合和。所以,其中source_id = 1,其中source_id = 1,type_id = 1,其中source_id = 1且品种= 1,其中type_id = 1且品种= 1,其中type_id = 1,其中source_id = 2,等等

我怎样才能最好地完成这项工作?

3 个答案:

答案 0 :(得分:4)

您可以使用ddply,并输入一个包含可能组合的列表,如下所示:

facs <- c("source_id","type_id","variety")

combs <-  unlist(
            mapply(function(j)combn(facs,j,simplify=F),1:3)
          ,recursive=F)

require(plyr)
datlist <- mapply(function(j)ddply(Data,j,summarize,sum(Sum)),combs)

require(reshape)
rbind.fill(datlist)

经过测试:

Data <- data.frame(
  Sum=rpois(10,5),
  source_id=rep(1:2,each=5),
  type_id=rep(1:5,each=2),
  variety=rep(1:2,5)
)

答案 1 :(得分:2)

这应该这样做

# generate dummy data

df = data.frame(
       Sum = rnorm(10), 
       source_id = sample(10, 5, replace = T), 
       type_id   = sample(10, 5, replace = T), 
       variety   = sample(10, 5, replace = T)
     )

index = names(df)[-1]
temp  = expand.grid(0:1, 0:1, 0:1)[-1,]

require(plyr)
cubedf = adply(temp, 1, function(x) 
   ddply(df, index[x == 1], summarize, SUM = sum(Sum)))

编辑:替代解决方案(使用从Joris借来的代码)

library(plyr)
# list factor variables
index  = names(df)[-1]

# generate all combinations of factor variables
combs  = unlist(llply(1:3, combn, x = index, simplify = F), recursive = F)

# calculate sum for each combination
cubedf = ldply(combs, function(var) 
            ddply(df, var, summarize, SUM = sum(Sum)))

答案 2 :(得分:1)

Joris的答案是对的。但我必须承认,乍一看对我来说并不直观。在阅读他的答案之前,我会用多个ddply()步骤解决这个问题。像这样:

Data <- data.frame(
  Sum=rpois(10,5),
  source_id=rep(1:2,each=5),
  type_id=rep(1:5,each=2),
  variety=rep(1:2,5)
)

require(plyr)

myStuff1 <- ddply(Data, c("source_id"                      ), function(df) sum(df$Sum) )
myStuff2 <- ddply(Data, c("source_id", "type_id"           ), function(df) sum(df$Sum) )
myStuff3 <- ddply(Data, c("source_id", "type_id", "variety"), function(df) sum(df$Sum) )
相关问题