汇总已在r中分组的数据

时间:2013-08-09 17:30:47

标签: r reshape summary

在R中使用以下数据集 ID = CUSTID

ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0

我想将数据集转换为这种格式的列

ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212

这样做的最佳方法是什么?无法找出R中最好的命令。

提前致谢

2 个答案:

答案 0 :(得分:4)

您可以使用reshape2软件包及其meltdcast函数重新构建数据。

data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA, 
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L, 
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L, 
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New", 
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L, 
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel", 
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA, 
-3L)) 

library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))

## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)

答案 1 :(得分:2)

更新/捂脸

数据集中的“NA”可能不是NA值,而是北美的缩写“NA”或类似的东西。

如果您在阅读数据时使用了na.strings,那么使用我最初指出的reshape应该没有问题:

mydf <- read.table(header = TRUE, na.strings = "", 
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0')

reshape(mydf, direction = "wide",
        idvar = c("ID", "Geo", "Brand", "Neworstream"),
        timevar = "Channel")

(但是,我可能会建议更改您的易读性缩写并减少混淆!)


原始答案(因为reshape还有一些有趣的东西)

这应该这样做:

reshape(mydf, direction = "wide", 
        idvar = c("ID", "Geo", "Brand", "Neworstream"), 
        timevar = "Channel")
#   ID  Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1  1 <NA>     1         New               5               0               1
# 3  3   EU     2      Stream              NA              NA              NA
#   RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1           NA           NA           NA
# 3            5            1            0

更新(尝试稍微挽回答案)

正如@Arun所指出的,上述情况并不完全正确。这里的罪魁祸首是interaction(),当reshape()指定了多个ID变量时,reshape()使用它来创建一个新的临时ID变量。

以下是来自data[, tempidname] <- interaction(data[, idvar], drop = TRUE) interaction(mydf[c(1, 2, 4, 5)], drop = TRUE) # [1] <NA> <NA> 3.EU.2.Stream # Levels: 3.EU.2.Stream 的行以及应用于我们的“mydf”对象时的样子:

NA

嗯。这似乎简化为两个ID,3.EU.2.StreamNA

如果我们将""替换为mydf$Geo <- as.character(mydf$Geo) mydf$Geo[is.na(mydf$Geo)] <- "" interaction(mydf[c(1, 2, 4, 5)], drop = TRUE) # [1] 1..1.New 1..1.Stream 3.EU.2.Stream # Levels: 1..1.New 1..1.Stream 3.EU.2.Stream 会怎样?

reshape()

Aaahh。那更好一点。我们现在有三个唯一的ID ...... reshape(mydf, direction = "wide", idvar=names(mydf)[c(1, 2, 4, 5)], timevar="Channel") # ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line # 1 1 1 New 5 0 # 2 1 1 Stream 5 0 # 3 3 EU 2 Stream NA NA # RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele # 1 1 NA NA NA # 2 1 NA NA NA # 3 NA 5 1 0 似乎有效。

{{1}}