查找对

时间:2018-02-12 20:46:02

标签: r list similarity recommendation-engine

我列出了参加这些活动的活动和嘉宾。像这样,但更大的文件:

event       guests
birthday    John Doe
birthday    Jane Doe
birthday    Mark White
wedding     John Doe
wedding     Jane Doe
wedding     Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement  Janet Black
retirement  Matthew Green

我想找到两位一起参加大多数活动的客人最常见的组合。因此,在此示例中,答案应该是John DoeJane Doe一起参加大多数事件,因为他们都参加了三个相同的事件。输出应该是这些对的列表。

我从哪里开始?

3 个答案:

答案 0 :(得分:3)

从你的陈述“一起参加大多数活动”我将假设你的意思相似intersect

您可以使用以下代码找到事件〜名称之间的交叉:

# All names that we have
nameAll <- unique(df$guests)
# Length of names vector
N <- length(nameAll)

# Function to find intersect between names
getSimilarity <- function(nameA, nameB, type = "intersect") {
    # Subset events for name A
    eventA <- subset(df, guests == nameA)$event
    # Subset events for name B
    eventB <- subset(df, guests == nameB)$event
    # Fint intersect length between events
    if (type == "intersect") {
        res <- length(intersect(eventA, eventB))
    }
    # Find Jaccard index between events
    if (type == "JC") {
        res <- length(intersect(eventA, eventB)) / length(union(eventA, eventB))
    }
    # Return result
    return(data.frame(type, value = res, nameA, nameB))
}

# Iterate over all possible combinations
# Using double loop for simpler representation    
result <- list()
for(i in 1:(N-1)) {
    for(j in (i+1):N) {
        result[[length(result) + 1]] <- getSimilarity(nameAll[i], nameAll[j])
    }
}
# Transform result to data.frame and order by similarity 
result <- do.call(rbind, result)
# Showing top 5 pairs
head(result[with(result, order(-value)), ])
       type value    nameA         nameB
1 intersect     3 John Doe      Jane Doe
2 intersect     1 John Doe    Mark White
3 intersect     1 John Doe Matthew Green
4 intersect     1 John Doe   Janet Black
5 intersect     1 John Doe  William Hill

Jaccard也会得到相同的结果:

   type     value       nameA        nameB
1    JC 1.0000000    John Doe     Jane Doe
15   JC 0.5000000 Janet Black William Hill
2    JC 0.3333333    John Doe   Mark White
5    JC 0.3333333    John Doe William Hill
6    JC 0.3333333    Jane Doe   Mark White

数据(df):

structure(list(event = c("birthday", "birthday", "birthday", 
"wedding", "wedding", "wedding", "bar mitzvah", "bar mitzvah", 
"bar mitzvah", "bar mitzvah", "retirement", "retirement"), guests = c("John Doe", 
"Jane Doe", "Mark White", "John Doe", "Jane Doe", "Matthew Green", 
"Janet Black", "John Doe", "Jane Doe", "William Hill", "Janet Black", 
"Matthew Green")), .Names = c("event", "guests"), row.names = c(NA, 
-12L), class = "data.frame")

答案 1 :(得分:3)

与社交网络/矩阵代数的观点略有不同:

您的数据通过共享成员资格描述个人之间的链接。这是一个隶属矩阵,我们可以计算个人$ i $和$ j $之间的连接矩阵,如下所示:

# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3), 
                           rep("wedding", 3), 
                           rep("bar mitzvah", 4), 
                           rep("retirement", 2)), 
                  guests = c("John Doe", "Jane Doe", "Mark White", 
                             "John Doe", "Jane Doe", "Matthew Green",   
                              "Janet Black", "John Doe", "Jane Doe",
                              "William Hill", "Janet Black", "Matthew Green"))

# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat


  ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
  #Jane Doe             3           1        3          1             1            1
  #Janet Black          1           2        1          0             1            1
  #John Doe             3           1        3          1             1            1
  #Mark White           1           0        1          1             0            0
  #Matthew Green        1           1        1          0             2            0
  #William Hill         1           1        1          0             0            1

现在我们想要摆脱矩阵的对角线(它告诉我们每个人参加了多少事件)以及包含冗余信息的矩阵的两个三角形之一。

diag(admat) <- 0
admat[upper.tri(admat)] <- 0

现在我们只想转换为您可能更喜欢的格式。我将在reshape2库中使用melt函数。

library(reshape2)
dfmatches <- unique(melt(admat))
# Drop all the zero matches
dfmatches <- dfmatches[dfmatches$value !=0,]
# order it descending
dfmatches <- dfmatches[order(-dfmatches$value),]
dfmatches

#            Var1        Var2 value
#3       John Doe    Jane Doe     3
#2    Janet Black    Jane Doe     1
#4     Mark White    Jane Doe     1
#5  Matthew Green    Jane Doe     1
#6   William Hill    Jane Doe     1
#9       John Doe Janet Black     1
#11 Matthew Green Janet Black     1
#12  William Hill Janet Black     1
#16    Mark White    John Doe     1
#17 Matthew Green    John Doe     1
#18  William Hill    John Doe     1

显然,您可以通过重命名感兴趣的变量等来整理输出。

这种一般方法 - 我的意思是认识到您的数据描述了一个社交网络 - 可能会让您感兴趣进行进一步的分析(例如,如果他们去参加有很多人的聚会,那么人们可能会有意义地联系在一起。同样的人,即使没有彼此)。如果你的数据集非常大,你可以通过使用稀疏矩阵,或者通过加载igraph包并使用那里的函数来声明社交网络,使矩阵代数更快一些。

答案 2 :(得分:1)

我认为这里的答案很棒。我只想分享一些想法。如果您正在处理大型数据集,包含许多客户或许多事件。许多条件都是可能的。例如,两个以上的客人都参加了相同的活动,或者两组客人参加了两个不同的活动,但总计数是相同的。如果是这种情况,找到前两位客人可能还不够。

在这里,我想演示使用层次聚类来查找相似的客户或组。

我们可以先构建一个1和0的矩阵,而1表示出勤,0表示没有出勤。

library(tidyverse)
library(vegan)

dat_m <- dat %>%
  mutate(value = 1) %>%
  spread(event, value, fill = 0) %>%
  column_to_rownames(var = "guests") %>%
  as.matrix()

dat_m
#               bar mitzvah birthday retirement wedding
# Jane Doe                1        1          0       1
# Janet Black             1        0          1       0
# John Doe                1        1          0       1
# Mark White              0        1          0       0
# Matthew Green           0        0          1       1
# William Hill            1        0          0       0

然后我们可以计算每位客人的距离。请注意,我使用了vegdist包中的vegan函数并设置了binary = TRUE,因为我们正在处理二进制数据。

dat_dist <- vegdist(dat_m, binary = TRUE)

dat_dist
#                Jane Doe Janet Black  John Doe Mark White Matthew Green
# Janet Black   0.6000000                                               
# John Doe      0.0000000   0.6000000                                   
# Mark White    0.5000000   1.0000000 0.5000000                         
# Matthew Green 0.6000000   0.5000000 0.6000000  1.0000000              
# William Hill  0.5000000   0.3333333 0.5000000  1.0000000     1.0000000

然后我们可以进行分层聚类并查看结果。

hc <- hclust(dat_dist)
plot(hc)

enter image description here

根据树形图,Jane DoeJohn Doe是最相似的,作为一个群体,它们与其他群体的差异最大。

我们还可以检查Jane DoeJohn Doe是否参加了最高的活动编号。所以我们知道我们可以选择这两个。

rowSums(dat_m)
# Jane Doe   Janet Black      John Doe    Mark White Matthew Green  William Hill 
#        3             2             3             1             2             1 

我再次认为其他人&#39;答案更直接,并为您提供此示例数据集的输出,但是如果您正在处理更大的数据集。分层聚类可以是一种选择。