大学名称的模糊匹配/加入两个数据框

时间:2018-10-30 19:53:47

标签: r merge text-mining fuzzy fuzzyjoin

我有一个输入的大学名称列表,其中包含拼写错误和不一致之处。我需要将它们与大学名称的正式列表进行匹配,以将我的数据链接在一起。

我知道模糊匹配/联接是我的解决之道,但是我对正确的方法有些迷失。任何帮助将不胜感激。

d<-data.frame(name=c("University of New Yorkk", "The University of South
 Carolina", "Syracuuse University", "University of South Texas", 
"The University of No Carolina"), score = c(1,3,6,10,4))

y<-data.frame(name=c("University of South Texas",  "The University of North
 Carolina", "University of South Carolina", "Syracuse
 University","University of New York"), distance = c(100, 400, 200, 20, 70))

我希望输出将它们尽可能紧密地合并在一起

matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina", 
"Syracuuse University","University of South Texas","The University of No Carolina"), 
correctmatch = c("University of New York", "University of South Carolina", 
"Syracuse University","University of South Texas", "The University of North Carolina"))

1 个答案:

答案 0 :(得分:1)

我将adist()用于此类操作,并且几乎没有名为closest_match()的包装函数,以帮助将值与一组“良好/允许”值进行比较。

library(magrittr) # for the %>%

closest_match <- function(bad_value, good_values) {
  distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
    as.numeric() %>%
    setNames(good_values)

  distances[distances == min(distances)] %>%
    names()
}

sapply(d$name, function(x) closest_match(x, y$name)) %>%
  setNames(d$name)

University of New Yorkk The University of South\n Carolina               Syracuuse University 
"University of New York"     "University of South Carolina"           "University of New York" 
University of South Texas      The University of No Carolina 
"University of South Texas"     "University of South Carolina" 

adist()利用Levenshtein distance比较两个字符串之间的相似性。