加入4个变量,然后使用data.table对较少的变量进行分组

时间:2017-03-03 21:02:31

标签: r data.table dplyr

这个帖子是我之前的帖子Join then mutate using data.table without intermediate table的延续。

在该主题中,我使用查找表来更改收入和数量,然后将结果除以.N,以便在汇总产品时,我看不到夸大的值。

根据该主题专家的建议,我不想指望用于加入的所有四个变量,即PO_IDSO_IDF_Year,{ {1}}但仅限Product_IDSO_IDF_Year

问题:如何使用Product_ID

执行此操作

以下是我的数据和代码:

以下是使用data.table

的数据和解决方案

输入

dplyr

查找表

DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012, 
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", 
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")

这是我使用DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", "A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 7L), class = "data.frame") 修改后的代码:

dplyr

请注意DF_Generated <- DFI %>% left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>% dplyr::group_by(SO_ID, F_Year, Product_ID) %>% dplyr::mutate(Count = n()) %>% dplyr::ungroup()%>% dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>% dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity) 的输入已更改。

预期产出:

group_by

注意:请注意,我不想创建中间变量,因为实际数据量太大,以至于这可能不可行。

1 个答案:

答案 0 :(得分:1)

这应该做你正在寻找的事情

library(data.table)
setDT(DFI)
DFI[ , c("Revenue", "Quantity") := NULL]

setDT(DF_Lookup)

dat = merge(DF_Lookup, DFI, by = c("PO_ID", "SO_ID", "F_Year", "Product_ID"))
dat = dat[ , .(Revenue = Revenue/.N, Quantity = Quantity/.N, Location1), by = .(PO_ID, SO_ID, F_Year, Product_ID)]

dat
    PO_ID SO_ID F_Year Product_ID   Revenue  Quantity Location1
 1: P1234    S1   2012       385X  25.00000  1.500000        MA
 2: P1234    S1   2012       385X  25.00000  1.500000        NY
 3: P1234    S2   2013       450X  23.33333  6.666667        WA
 4: P1234    S2   2013       450X  23.33333  6.666667        NY
 5: P1234    S2   2013       450X  23.33333  6.666667        WA
 6: P1234    S2   2013       900X  35.00000 20.000000        NY
 7: P2345    S3   2011       3700  50.00000 10.000000        IL
 8: P2345    S3   2011       3700  50.00000 10.000000        IL
 9: P3456    S7   2014       A11U  50.00000 20.000000        MN
10: P4567   S10   2015       2700 100.00000 40.000000        CA
相关问题