R汇总由其他列分组的列与汇总

时间:2016-09-16 18:26:09

标签: r

我已经编写了以下内容,它将汇总输入数据集中的目标列,并包含每个其他列的部分和(或汇总或任何优选的白话)。

这样可以正常工作,但是有一个不受欢迎的嵌套for循环,我想删除它以支持更多“功能”方法。我已经尝试了这一点,但是尽管阅读和练习不止一点,但在涉及各种apply和/或dplyr时,我仍处于非grokkery 状态功能

很可能我所做的一切都是错的;例如如果最终解决方案不需要它,那么为循环准备的设置可能是不必要的......基本上我只是希望在给定提供的输入时生成预期的输出...

无论如何,这是代码:

# dummy data -- assume this is given 
#######################################################################
df1 <- c("AA","B","AA","B","AA","B","AA","B","AA","B","AA","B",
         "M","M","N","N","M","M","N","N","M","M","N","N",
         "X","X","X","X","Y","Y","Y","Y","Z","Z","Z","Z",
         2,3,4,4,2,3,5,4,3,2,5,4)
dim(df1) <- c(12,4)
colnames(df1) <- c("f1","f2","f3","cnt")
df1 <- as.data.frame(df1,stringsAsFactors=F)
df1$cnt <- as.integer(df1$cnt)
#######################################################################
library(data.table)

# some hard-coded variables...
anyStr <- "(any)"       # this string cannot appear in df1
targetColName <- "cnt"  # name of the column being summed from df1
outputColName <- "sum"  # name of our output column

# grab names of only the columns we're going after... (just do everything but the target)
colsToSummarize = (colnames(df1)[!colnames(df1) %in% list(targetColName)])

# create a data table of just the unique values for each of those columns...
df2 <- lapply(colsToSummarize, function(x) { unique(df1[,x])})
df2 <- as.data.table(df2)

# add a dummy row that basically means "any value" ...
# this string cannot otherwise be present in the data...
df2 <- rbind(df2,as.data.table(t(rep(anyStr,length(df2)))))
colnames(df2) <- c(colsToSummarize)

# expand df2 to generate all possible settings found in df1...
df2 <- unique(expand.grid(df2))
rownames(df2)<-NULL

# do all the sums... there's probably a clever way to do this using "apply" functions...
df2[,eval(outputColName)] <- 0
for (i2 in 1:nrow(df2)) {
  for (i1 in 1:nrow(df1)) {
    isMatch = T
    for (j in colsToSummarize) {
      if ((df2[i2,eval(j)]!=anyStr) & (df1[i1,eval(j)]!=df2[i2,eval(j)])) {
        isMatch = F
        break
      }
    }
    if (isMatch) {
      df2[i2,eval(outputColName)] = df2[i2,eval(outputColName)] + df1[i1,eval(targetColName)]
    }
  }
}

因此,样本虚拟数据如下所示:

> df1
   f1 f2 f3 cnt
1  AA  M  X   2
2   B  M  X   3
3  AA  N  X   4
4   B  N  X   4
5  AA  M  Y   2
6   B  M  Y   3
7  AA  N  Y   5
8   B  N  Y   4
9  AA  M  Z   3
10  B  M  Z   2
11 AA  N  Z   5
12  B  N  Z   4

......和预期的输出:

> df2
      f1    f2    f3 sum
1     AA     M     X   2
2      B     M     X   3
3  (any)     M     X   5
4     AA     N     X   4
5      B     N     X   4
6  (any)     N     X   8
7     AA (any)     X   6
8      B (any)     X   7
9  (any) (any)     X  13
10    AA     M     Y   2
11     B     M     Y   3
12 (any)     M     Y   5
13    AA     N     Y   5
14     B     N     Y   4
15 (any)     N     Y   9
16    AA (any)     Y   7
17     B (any)     Y   7
18 (any) (any)     Y  14
19    AA     M     Z   3
20     B     M     Z   2
21 (any)     M     Z   5
22    AA     N     Z   5
23     B     N     Z   4
24 (any)     N     Z   9
25    AA (any)     Z   8
26     B (any)     Z   6
27 (any) (any)     Z  14
28    AA     M (any)   7
29     B     M (any)   8
30 (any)     M (any)  15
31    AA     N (any)  14
32     B     N (any)  12
33 (any)     N (any)  26
34    AA (any) (any)  21
35     B (any) (any)  20
36 (any) (any) (any)  41

当然,我的输出基本相同; (例如NA或空格或其他而不是“(任何)”,行/列的顺序并不重要,等等......)

除此之外:这与SQL group by with rollup不完全相同,因为它提供了所有排列而不是基于group by子句中变量顺序的子集...如果读取此内容的人想要该子集,他们只需要删除包含意外“(任意)”值的行。

1 个答案:

答案 0 :(得分:2)

您可以将addmargins()与ftable()结合使用。 首先是表格,其中总结了群组的cnt:

    table1 <- xtabs(cnt ~f1 + f2 + f3, data= df1)
> table1
, , f3 = X

    f2
f1   M N
  AA 2 4
  B  3 4

, , f3 = Y

    f2
f1   M N
  AA 2 5
  B  3 4

, , f3 = Z

    f2
f1   M N
  AA 3 5
  B  2 4

然后使用addmargins()计算部分和

tablle2 <- addmargins(table1)
> tablle2
, , f3 = X

     f2
f1     M  N Sum
  AA   2  4   6
  B    3  4   7
  Sum  5  8  13

, , f3 = Y

     f2
f1     M  N Sum
  AA   2  5   7
  B    3  4   7
  Sum  5  9  14

, , f3 = Z

     f2
f1     M  N Sum
  AA   3  5   8
  B    2  4   6
  Sum  5  9  14

, , f3 = Sum

     f2
f1     M  N Sum
  AA   7 14  21
  B    8 12  20
  Sum 15 26  41

最后ftable()把它带到一个很好的形式:

table3 <- ftable(tablle2)
> table3
        f3  X  Y  Z Sum
f1  f2                 
AA  M       2  2  3   7
    N       4  5  5  14
    Sum     6  7  8  21
B   M       3  3  2   8
    N       4  4  4  12
    Sum     7  7  6  20
Sum M       5  5  5  15
    N       8  9  9  26
    Sum    13 14 14  41

最后一次使用的as.data.frame是以问题中提到的形式出现的:

 table4 <- as.data.frame(table3)
   > table4
        f1  f2  f3 Freq
    1   AA   M   X    2
    2    B   M   X    3
    3  Sum   M   X    5
    4   AA   N   X    4
    5    B   N   X    4
    6  Sum   N   X    8
    7   AA Sum   X    6
    8    B Sum   X    7
    9  Sum Sum   X   13
    10  AA   M   Y    2
    11   B   M   Y    3
    12 Sum   M   Y    5
    13  AA   N   Y    5
    14   B   N   Y    4
    15 Sum   N   Y    9
    16  AA Sum   Y    7
    17   B Sum   Y    7
    18 Sum Sum   Y   14
    19  AA   M   Z    3
    20   B   M   Z    2
    21 Sum   M   Z    5
    22  AA   N   Z    5
    23   B   N   Z    4
    24 Sum   N   Z    9
    25  AA Sum   Z    8
    26   B Sum   Z    6
    27 Sum Sum   Z   14
    28  AA   M Sum    7
    29   B   M Sum    8
    30 Sum   M Sum   15
    31  AA   N Sum   14
    32   B   N Sum   12
    33 Sum   N Sum   26
    34  AA Sum Sum   21
    35   B Sum Sum   20
    36 Sum Sum Sum   41