这个帖子是我之前的帖子Join then mutate using data.table without intermediate table的延续。
在该主题中,我使用查找表来更改收入和数量,然后将结果除以.N
,以便在汇总产品时,我看不到夸大的值。
根据该主题专家的建议,我不想指望用于加入的所有四个变量,即PO_ID
,SO_ID
,F_Year
,{ {1}}但仅限Product_ID
,SO_ID
,F_Year
。
问题:如何使用Product_ID
?
以下是我的数据和代码:
以下是使用data.table
输入
dplyr
查找表
DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S2", "S2", "S2", "S2", "S3", "S3", "S7", "S10"), F_Year = c(2012,
2012, 2013, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "450X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1,
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA",
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")
这是我使用DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345",
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3",
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011,
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700",
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100),
Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA,
7L), class = "data.frame")
修改后的代码:
dplyr
请注意DF_Generated <- DFI %>%
left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
dplyr::group_by(SO_ID, F_Year, Product_ID) %>%
dplyr::mutate(Count = n()) %>%
dplyr::ungroup()%>%
dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)
的输入已更改。
预期产出:
group_by
注意:请注意,我不想创建中间变量,因为实际数据量太大,以至于这可能不可行。
答案 0 :(得分:1)
这应该做你正在寻找的事情
library(data.table)
setDT(DFI)
DFI[ , c("Revenue", "Quantity") := NULL]
setDT(DF_Lookup)
dat = merge(DF_Lookup, DFI, by = c("PO_ID", "SO_ID", "F_Year", "Product_ID"))
dat = dat[ , .(Revenue = Revenue/.N, Quantity = Quantity/.N, Location1), by = .(PO_ID, SO_ID, F_Year, Product_ID)]
dat
PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1
1: P1234 S1 2012 385X 25.00000 1.500000 MA
2: P1234 S1 2012 385X 25.00000 1.500000 NY
3: P1234 S2 2013 450X 23.33333 6.666667 WA
4: P1234 S2 2013 450X 23.33333 6.666667 NY
5: P1234 S2 2013 450X 23.33333 6.666667 WA
6: P1234 S2 2013 900X 35.00000 20.000000 NY
7: P2345 S3 2011 3700 50.00000 10.000000 IL
8: P2345 S3 2011 3700 50.00000 10.000000 IL
9: P3456 S7 2014 A11U 50.00000 20.000000 MN
10: P4567 S10 2015 2700 100.00000 40.000000 CA