我有两个data.frames - 一个查找表,告诉我组中包含的一组产品。每个组至少一个类型1和类型2的产品。
第二个data.frame告诉我有关交易的详细信息。每笔交易都可以使用以下产品之一:
a)只有来自其中一个组的<1>的产品 s
b)只有来自其中一个组
的第2类产品 sc)来自同一组的类型1和类型2 的产品
对于我的分析,我有兴趣在上面找出c),即有多少交易产品类型1 和类型2(来自同一组)已售出。如果在同一交易中出售的不同组中的类型1的产品和类型2的产品,我们将完全忽略该交易。
因此,类型1或类型2的每个产品必须属于同一组。
这是我的查询表:
> P_Lookup
Group ProductID1 ProductID2
Group1 A 1
Group1 B 2
Group1 B 3
Group2 C 4
Group2 C 5
Group2 C 6
Group3 D 7
Group3 C 8
Group3 C 9
Group4 E 10
Group4 F 11
Group4 G 12
Group5 H 13
Group5 H 14
Group5 H 15
例如,我在一次交易中没有产品G和产品15,因为它们属于不同的组。
以下是交易:
TransactionID ProductID ProductType
a1 A 1
a1 B 1
a1 1 2
a2 C 1
a2 4 2
a2 5 2
a3 D 1
a3 C 1
a3 7 2
a3 8 2
a4 H 1
a5 1 2
a5 2 2
a5 3 2
a5 3 2
a5 1 2
a6 H 1
a6 15 2
我的代码:
现在,我能够使用dplyr
编写代码,以便从一个组中筛选交易。但是,我不确定如何为所有组的代码进行矢量化。
这是我的代码:
P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]
P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) |
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)
P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]
只要我手动选择每个组,即设置Chosen_Group
,这就行得很好。但是,我不确定如何自动执行此操作。一种方式,我想是使用for循环,但我知道R的美丽是矢量化,所以我想远离使用for循环。
我真诚地感谢任何帮助。我花了差不多两天时间。我看了using dplyr in for loop in r,但似乎这个主题正在谈论另一个问题。
数据:
这是dput
的{{1}}:
P_Trans
这里structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D",
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1,
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID",
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")
dput
:
P_Lookup
在向查找表中不存在的P_Trans中添加产品之后的structure(list(Group = c("Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4",
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A",
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H",
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA,
15L), class = "data.frame")
:
dput()
答案 0 :(得分:4)
下面是一个tidyverse(dplyr,tidyr和purrr)解决方案,我希望能帮到你。
请注意,在最后一行中使用map_df
会将所有结果作为数据框返回。如果您希望它成为每个组的列表对象,那么只需使用map
。
library(dplyr)
library(tidyr)
library(purrr)
# Save unique groups for later use
P_Groups <- unique(P_Lookup$Group)
# Convert lookup table to product IDs and Groups
P_Lookup <- P_Lookup %>%
gather(ProductIDn, ProductID, ProductID1, ProductID2) %>%
select(ProductID, Group) %>%
distinct() %>%
nest(-ProductID, .key = Group)
# Bind Group information to transactions
# and group for next analysis
P_Trans <- P_Trans %>%
left_join(P_Lookup) %>%
filter(!map_lgl(Group, is.null)) %>%
unnest(Group) %>%
group_by(TransactionID)
# Iterate through Groups to produce results
map(P_Groups, ~ filter(P_Trans, Group == .)) %>%
map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>%
map_df(~ filter(., No_of_PIDs > 1))
#> Source: local data frame [12 x 5]
#> Groups: TransactionID [4]
#>
#> TransactionID ProductID ProductType Group No_of_PIDs
#> <chr> <chr> <dbl> <chr> <int>
#> 1 a1 A 1 Group1 2
#> 2 a1 B 1 Group1 2
#> 3 a1 1 2 Group1 2
#> 4 a2 C 1 Group2 2
#> 5 a2 4 2 Group2 2
#> 6 a2 5 2 Group2 2
#> 7 a3 D 1 Group3 2
#> 8 a3 C 1 Group3 2
#> 9 a3 7 2 Group3 2
#> 10 a3 8 2 Group3 2
#> 11 a6 H 1 Group5 2
#> 12 a6 15 2 Group5 2
答案 1 :(得分:2)
以下是单个管道dplyr
解决方案:
P_DualGroupTransactionsCount <-
P_Lookup %>% # data needing single column map of Keys
gather(IDnum, ProductID, ProductID1:ProductID2) %>% # produce long single map of Keys for GroupID (tidyr::)
right_join(P_trans) %>% # join transactions to groupID info
group_by(TransactionID, Group) %>% # organize for same transaction & same group
mutate(DualGroup = ifelse(n_distinct(ProductType)==2, T, F)) %>% # flag groups with both groups in a single transaction
filter(DualGroup == T) %>% # choose only doubles
select(TransactionID, Group) %>% # remove excess columns
distinct %>% # remove excess rows
nrow # count of unique transaction ID's
# P_DualGroupTransactions
# Source: local data frame [4 x 2]
# Groups: TransactionID, Group [4]
#
# TransactionID Group
# <chr> <chr>
# 1 a1 Group1
# 2 a2 Group2
# 3 a3 Group3
# 4 a6 Group5
# P_DualGroupTransactionsCount
[1] 4