汇总数据时的行返回率

时间:2019-06-18 07:30:05

标签: r data.table

我在R中有一个大数据集,正在与data.table争吵。我想汇总一些数据,并返回每行的行值与总数之比。

我已经设法通过dcast达到了大部分目标,但是我不知道最后一步。

library(data.table)
tab <- "year  qtr  sales  value
2016  1  A  50
2016  2  A  70
2016  3  A  90
2016  4  A  100
2017  1  A  80
2017  2  A  70
2017  3  A  80
2017  4  A  110
2016  1  B  33
2016  2  B  90
2016  3  B  120
2016  4  B  60
2017  1  B  120
2017  2  B  10
2017  3  B  88
2017  4  B  99
"

dt <- fread(tab)

dcast(dt, sales ~ year, fun.agg = function(x) sum(x), value.var = 'value')

   sales 2016 2017
1:     A  310  340
2:     B  303  317

我真正想要的是行比率(即310 /(310 + 340)等)

   sales  2016  2017
1:     A  0.47  0.52
2:     B  0.49  0.51

这怎么办?

3 个答案:

答案 0 :(得分:2)

只需除以rowSums(对{em> @Ronak Shah 不需apply表示感谢)

dt2[, -1] / rowSums(dt2[, -1])
#           [,1]      [,2]
# 2016 0.4769231 0.4887097
# 2017 0.5230769 0.5112903

当然在cbind时将其移置并舍入。

dt2 <- cbind(dt2[, 1], t(round(dt2[, -1] / rowSums(dt2[, -1]), 2)))
dt2
#    sales 2016 2017
# 1:     A 0.48 0.52
# 2:     B 0.49 0.51

这里最好的做法可能是保持简洁data.table并按照 @ chinsoon12 在评论中指出的那样一步完成。

dt2 <- dcast(dt[, x := round(value / sum(value), 2), by=.(sales)], sales ~ year, sum, value.var='x')
dt2
#    sales 2016 2017
# 1:     A 0.48 0.52
# 2:     B 0.49 0.51

数据

dt <- structure(list(year = c(2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 
                              2017L, 2017L, 2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 
                              2017L), qtr = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
                                              1L, 2L, 3L, 4L), sales = c("A", "A", "A", "A", "A", "A", "A", 
                                                                         "A", "B", "B", "B", "B", "B", "B", "B", "B"), value = c(50L, 
                                                                                                                                 70L, 90L, 100L, 80L, 70L, 80L, 110L, 33L, 90L, 120L, 60L, 120L, 
                                                                                                                                 10L, 88L, 99L)), row.names = c(NA, -16L), class = c("data.table", 
                                                                                                                                                                                     "data.frame"))
dt2 <- dcast(dt, sales ~ year, fun.agg = function(x) sum(x), value.var = 'value')

答案 1 :(得分:2)

另一种直接的data.table解决方案:

dt[, .(tmp = sum(value)), by = .(year, sales)
   ][, .(value = tmp / sum(tmp), sales), by = .(year)
     ][, dcast(.SD, sales ~ year)]

#    sales      2016      2017
# 1:     A 0.5057096 0.5175038
# 2:     B 0.4942904 0.4824962

答案 2 :(得分:1)

使用tidyverse,我们可以group_by salesyear并获得每个组中的比率,并将spread转换为宽格式。

library(tidyverse)

dt %>%
  group_by(sales, year) %>%
  summarise(value = sum(value)) %>%
  mutate(value = value/sum(value)) %>%
  spread(year, value)

#  sales `2016` `2017`
#  <chr>  <dbl>  <dbl>
#1 A      0.477  0.523
#2 B      0.489  0.511