R data.table group由多列组成1列和求和

时间:2015-04-24 10:43:15

标签: r group-by data.table

我有以下data.table

> dt = data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
> dt
   sales_ccy sales_amt cost_ccy cost_amt
1:       USD       500      GBP     -100
2:       EUR       600      USD     -200
3:       GBP       700      GBP     -300
4:       USD       800      USD     -400

我的目标是获得以下data.table

> dt
   ccy total_amt
1: EUR       600
2: GBP       300
3: USD       700

基本上,我想按货币汇总所有成本和销售额。实际上,这个data.table有> 500,000行,所以我想要一种快速有效的方法来总结这些数量。

想要快速做到这一点的想法吗?

5 个答案:

答案 0 :(得分:9)

使用data.table v1.9.6+,其melt的改进版本可以同时融入多个列,

require(data.table) # v1.9.6+
melt(dt, measure = patterns("_ccy$", "_amt$")
    )[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]

答案 1 :(得分:7)

您可以从我的" splitstackshape"中考虑txtClient.setText(""); txtDate.setText(""); txtHour.setText(""); 封装

在这里,我还使用了" dplyr"如果您愿意,可以跳过它。

merged.stack

" data.table"的开发版本应该能够处理熔化的色谱柱。它也比library(dplyr) library(splitstackshape) dt %>% mutate(id = 1:nrow(dt)) %>% merged.stack(var.stub = c("ccy", "amt"), sep = "var.stubs", atStart = FALSE) %>% .[, .(total_amt = sum(amt)), by = ccy] # ccy total_amt # 1: GBP 300 # 2: USD 700 # 3: EUR 600 快。

答案 2 :(得分:3)

比@Pgibas的解决方案更脏:

dt[,
   list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
   by=list(sales_ccy, cost_ccy)  # nro of rows reduced to only unique combination ales_ccy, cost_ccy
  ][,
    sum(V2), # this will aggregate the new columns
    by=V1
    ]

<强>基准

我做了一些测试来检查我的代码与Arun建议的Data Table 1.9.5的解决方案。

只是一个观察,我刚刚生成500K +行重复原始data.table,这减少了几个sales_ccy / cost_ccy的数量,这也减少了第二个data.table []所挤压的行数(只创建了8行)在这种情况下)。

我不认为在现实世界的场景中,返回的行数将接近500K +(可能,但我刚刚研究过这些东西,N ^ 2,其中N是使用的货币数量),但是仍然要注意观察这些结果。

library(data.table)
library(microbenchmark)

rm(dt)
dt <- data.table(sales_ccy = c("USD", "EUR", "GBP", "USD"), sales_amt = c(500,600,700,800), cost_ccy = c("GBP","USD","GBP","USD"), cost_amt = c(-100,-200,-300,-400))
dt


for (i in 1:17) dt <- rbind(dt,dt)

mycode <-function() {
  test1 <- dt[,
              list(c(sales_ccy, cost_ccy),c(sum(sales_amt), sum(cost_amt))), # this will create two new columns with ccy and amt
              keyby=list(sales_ccy, cost_ccy) 
             ][,
                sum(V2), # this will aggregate the new columns
                by=V1
              ]
  rm(test1)
}

suggesteEdit <- function() {

  test2 <- dt[ , .(c(sales_ccy, cost_ccy), c(sales_amt, cost_amt)) # combine cols
   ][, .(tot_amt = sum(V2)), keyby= .(ccy = V1)          # aggregate + reorder
     ]
   rm(test2)
}

meltWithDataTable195 <- function() {
  test3 <- melt(dt, measure = list( c(1,3), c(2,4) ))[, .(tot_amt = sum(value2)), keyby = .(ccy=value1)]
  rm(test3)
}

microbenchmark(
  mycode(),
  suggesteEdit(),
  meltWithDataTable195()
)

<强>结果

Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
               mycode() 12.27895 12.47456 15.04098 12.80956 14.73432 45.26173   100
         suggesteEdit() 25.36581 29.56553 42.52952 33.39229 59.72346 69.74819   100
 meltWithDataTable195() 25.71558 30.97693 47.77700 58.68051 61.23996 66.49597   100

答案 3 :(得分:3)

已编辑使用aggregate()

执行此操作的另一种方法
df = data.frame(ccy = c(dt$sales_ccy, dt$cost_ccy), total_amt = c(dt$sales_amt, dt$cost_amt))
out= aggregate(total_amt ~ ccy, data = df, sum)

答案 4 :(得分:2)

肮脏但有效

# Bind costs and sales
df <- rbind(df[,list(ccy = cost_ccy, total_amt = cost_amt)], 
            df[,list(ccy = sales_ccy, total_amt = sales_amt)])
# Sum for every currency
df[, sum(total_amt), by = ccy]
   ccy  V1
1: GBP 300
2: USD 700
3: EUR 600
相关问题