在加入Dplyr之前比较两个数据帧之间的名称列

时间:2016-06-09 21:15:18

标签: r dplyr tidyr stringr

我想知道在dplyr中进行连接之前是否可以轻松比较列。以下是两个简单的数据帧。我想根据名字和姓氏进行完全加入,但是有一些拼写错误或不同的格式,例如" Elizabeth Ray" vs"伊丽莎白"。

我想在加入之前比较这些列。我希望有一种方法可以产生所有与索引差异的列表或向量,这样我就可以在加入之前纠正它们。

如果有更简单的方法,我也会对此持开放态度,但我希望能采用最简单的方法。我想要一个基于dplyr,tidyr和stringr的解决方案。

FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)

3 个答案:

答案 0 :(得分:0)

我正在回答,因为我无法访问评论

df = Names1[!(Names1$LastNames %in% Names2$LastNames2), ]

尝试一下代码。

答案 1 :(得分:0)

为了比较您的记录之间的相似性,我认为您可能正在寻找一种方法来将一定量的模糊逻辑匹配应用于您的名称比较任务。 AKA:应用String Distance Function 执行Record Linkage任务时。 (如果你已经知道这一切,请原谅我 - 但这些关键词在开始时对我有很大的帮助。)

有一个名为stringdist的优秀软件包适用于这些应用程序,但recordlinkage可能会帮助您最快地完成对齐数据框架的工作。

如果您希望查看姓氏和姓氏最相似的值,请使用以下代码:

library(RecordLinkage)
library(dplyr)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)

# RecordLinkage function that calculates string distance b/w records in two data frames
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
Results
#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

# Trim $pairs dataframe (seen above) to contain just id's & similarity measures
PairsSelect <- 
    Results$pairs %>% 
    select(id = id1, firstNameSim = firstName, lastNameSim = lastName)

# Join original data & string comparison results together
# reorganize data to facilitate review
JoinedResults <-
    left_join(Names1, Names2) %>% 
    left_join(PairsSelect) %>% 
    select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>% 
    arrange(desc(lastNameSim), desc(firstNameSim), id)
JoinedResults
# id firstNameSim firstName firstName2 lastNameSim      lastName lastName2
# 1  2            1      Doug       Doug   1.0000000       Shapiro   Shapiro
# 2  4            1   Bubbles    Bubbles   1.0000000        Murphy    Murphy
# 3  1            1     Chris      Chris   0.9800000    MacDougall MacDougal
# 4  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku
# 5  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth

# If you want to collect just the perfect matches
PerfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim == 1 & lastNameSim == 1) %>% 
    select(id, firstName, lastName)
PerfectMatches
#   id firstName lastName
# 1  2      Doug  Shapiro
# 2  4   Bubbles   Murphy

# To collect the matches that are going to need alignment
ImperfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim < 1 | lastNameSim < 1) %>% 
    mutate(flgFrstNm = 0, flgLstNm = 0)
ImperfectMatches
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         0        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         0        0
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         0        0
# 

# If you want to enter your column preference in a flag column to facilitate faster rectification...
write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
# Flag data externally - save file to new name with '_reviewed' appended to filename
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
#reload results
FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
FlaggedMatches
## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         1        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         1        1
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         1        0

## Executing Assembly of preferred/rectified firstName and lastName columns
ResolvedMatches <- 
    FlaggedMatches %>% 
    mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
           rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>% 
    select(id, starts_with("rectified"))

ResolvedMatches
# id rectifiedFirstName rectifiedLastName
# 1  1              Chris         MacDougal
# 2  3           Shintaro          Yamazaki
# 3  5               Elsa         Elizabeth

dplyr非常直观,但compare.linkage()函数可以使用一些解释。

前两个参数很明显:您要比较的两个数据帧(dataframe1和dataframe2)。 [如果您只想将onedataframe内的记录与自身进行比较(以重复记录记录集),那么您可以改为使用compare.dedup(),并仅引用一个数据帧。

在这种情况下,将blockfld设置为1或2将分别指定First Name或Last Name上的匹配项必须为100%。相反,您可能希望在数据集中包含主/外键并在blckfld参数中引用该列。或者,如果您的记录实际上并非如此等效构建,则可以完全保留此参数(默认为FALSE),然后将比较所有可能的组合[数据帧的叉积]。

strcmpTRUE会为您提供应用于您要比较的数据列的字符串距离函数;如果你把它保留为假,那么它只是测试精确的1:1字符串对应。

exclude也是避免构建中间数据框并仅选择您希望彼此比较的列的好方法:排除3只是允许我们删除宠物和住宅的比较结果。

上面代码中的4列,键控数据帧(不是原始问题3列数据帧)产生的结果如下:

#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

上面的每个部分(例如$ pair)都是它自己的数据框。 添加一个键,您可以将它们全部连接在一起然后引用并使用成对的df作为切换级别门,然后甚至将data1值复制到data2帧中,例如当您有一个&gt;配对评级中的0.95值。 (注意:is_match看起来很重要,但它适用于训练匹配工具,与我们的任务无关。)

无论如何,我希望你发现这些图书馆能够突然增加你的工作量,就像我第一次遇到它时一样令人头疼。

BTW:我还发现这个Comparison of String Distance Algorithms是对目前可用的字符串距离指标的一个很好的调查。

答案 2 :(得分:0)

使用标准adist()函数作为@alistaire建议提供了一种非常有效的方法(并且可能更有可能是教师寻求使用的。)adist的字符串度量受到限制广义的Levenshtein(编辑)距离,但这看起来正是你正在寻找的。

代码如下: (因为这看起来像是专门用于数据处理的R编码类的介绍,我在一些最佳实践中添加了抛光到可重现/提出的问题。)

library(dplyr)

id <- c(1:5)
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5)
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# NB: technically you could merge these data frames later with `bind_cols()` but best 
# datahandling practices dictate joining/comparing data based on keys (instead of 
# binding columns together based upon the order in which tables are initially arranged.)
#[also preference is for column headers to be singular and lower case, and tables/dataframes to be uppercase and plural - from (or extension from principles in): https://google.github.io/styleguide/Rguide.xml]

## adist() calculates string distance b/w records in two data frames
# Matrix between all columns is great way to ascertain similarity of data
# on overall column to column basis.
# 0 is closest resemblance, higher numbers are lowest resemblance
ResultsInterColumnComparison <-
    adist(Names1, Names2, partial = T)
ResultsInterColumnComparison

# firstName to firstName2 & lastName to LastName2 are similar columns.
#           id firstName2 lastName2 dwelling
# id         0          2         2        2
# firstName 15          0         3        4
# lastName  15          3         0        5
# pet       15          5         4        3

# adist column to column DifferenceCount (using dplyr)
dltFrstN <- diag(adist(Names1$firstName, Names2$firstName2, partial = T))
dltLstN <- diag(adist(Names1$lastName, Names2$lastName2, partial = T))

# Join all info together
DFcompilation <- 
    data.frame(id, dltFrstN, firstName, firstName2, dltLstN, lastName, lastName2) %>% 
    arrange(desc(dltLstN), desc(dltLstN))
DFcompilation
#   id dltFrstN firstName firstName2 dltLstN      lastName lastName2
# 1  5        0      Elsa       Elsa       4 Elizabeth Ray Elizabeth
# 2  1        0     Chris      Chris       1    MacDougall MacDougal
# 3  3        0  Shintaro   Shintaro       1      Yamazaki  Yamazaku
# 4  2        0      Doug       Doug       0       Shapiro   Shapiro
# 5  4        0   Bubbles    Bubbles       0        Murphy    Murphy

这种方法更简单,所需的编码更简洁。我希望这对你的目的更有帮助。