Question

我有两个不同的data.frames＆＃34; String＆＃34;和＆＃34;关键词＆＃34;单列如下所述。＆＃34;字符串＆＃34;有50000行和＆＃34;关键字＆＃34;，10000行。

String
#I love New York  
#Live in Los Angeles  
#He stays in Yorkshire  
#Condo in Lowell  
# ...

Keywords 
#Ohio  
#Montreal  
#Los Vego  
#York  
#New York   
#Lowell    
#...

结果应存储在数据框中，其中包含列＆＃34;字符串＆＃34;和＆＃34;结果＆＃34;看起来如下所示

Result  

#              String        Result  
#       I love New York    New York     
#   Live in Los Angeles          NA  
# He stays in Yorkshire        York  
#       Condo in Lowell      Lowell

字符串匹配应该是精确的，但可以不区分大小写。

Answer 1

我不认为这是最理想的解决方案，但确实有效：

stringFrame <- data.frame(String = c("I love New York","Live in Los Angeles","He stays in Yorkshire","Condo in Lowell"),
                      stringsAsFactors = FALSE) 
 wordFrame   <- data.frame(Keywords = c("Ohio","Montreal","Los Vego","York","New York","Lowell"),
                      stringsAsFactors = FALSE)

 result <- stringFrame
 for (i in 1:dim(result)[1]){
  string = result[i,"String"]
  temp = ""
  for (word in wordFrame$Keywords){
    if (grepl(word,string,ignore.case=TRUE)){
      if (nchar(word) > nchar(temp)){
        result[i,"Result"] <- word
        temp <- word
      }
    }
  }
}

我在标题中看到你正在寻找最长的单词，所以我更新了答案。现在你将永远得到

 String               Result
 I love New York    New York

Answer 2

您可以使用stringdist包，它实现了最长的公共子串方法。 amatch功能可用于匹配单词＆＃39;你的字符串：

strings <- data.frame(string=c("I love New York","Live in Los Angeles",
  "He stays in Yorkshire","Condo in Lowell"), stringsAsFactors = FALSE)
words   <-c("Ohio","Montreal","Los Vego","York",
  "New York","Lowell")

library(stringdist) 

strings$result = words[amatch(strings$string, words, method="lcs", maxDist=1E6)]

AS @NickK评论此匹配Lives in Los Angeles与Los Vego。为了过滤这些部分匹配，你可以做

# filter out partial matches
match <- nchar(strings$string) - nchar(strings$result)  ==
  stringdist(strings$result, strings$string, method="lcs")
strings$result[!match] <- NA

此解决方案似乎比@NickK略慢。使用他的示例数据集，上面的解决方案在我的系统上需要486秒，而他需要416秒。

Answer 3

这应该比目前为止显示的嵌套循环快得多。在我的机器上，没有任何并行化，它在大约12.5分钟内完成100,000个字符串和50,000个字/子串。

示例数据：

library("data.table")
# Downloaded from https://raw.githubusercontent.com/datasets/airport-codes/master/data/airport-codes.csv
airports <- fread("airport-codes.csv")
first_bit <- paste(c("Lives", "Works", "Plays", "Condo", "Apartment", "I love"), "in")

places <- unique(c(airports$name, airports[!municipality == "", municipality]))

set.seed(123)
strings <- data.table(
  string = paste(sample(first_bit, 1e5, TRUE),
                 sample(places, 1e5, TRUE))
)
words <- sample(places, 5e4)

基于grepl的实际例程：

system.time({
  strings[, `:=`(lower = tolower(string), result = NA_character_)]
  words <- words[order(nchar(words), words, decreasing = TRUE)]
  i <- 0
  for (x in words) {
    i <- i + 1
    if (i %% 100 == 0) cat(i, "\n")
    found <- grepl(tolower(x), strings$lower, fixed = TRUE)
    strings[found & is.na(result), result := x]
  }
  strings[, lower := NULL]
})

请注意，在Windows上，fread及其类似的工作在开箱即用的https链接上，但在Linux上，您需要使用download.file和相应的curl或{{1选项。

编辑 OP现在表示他只想要全字匹配。这可以使用非固定匹配和正则表达式中的wget语法来实现。然而，这也是一个更快地完成整个事情的机会。

这是一个建议的算法，在我的机器上运行不到一分钟。它将每个字符串分成空间边界的单词（在首先将多个连续空格压缩为一个之后）。然后它计算由整个单词组成的每个可能子串的长度。然后，搜索到的关键字按长度分割，\b可用于查找子字符串和关键字之间的精确匹配。由于关键字从最大到最小排序，因此它将始终使用可用的最长关键字。

match

跨两个不同数据帧的最大子串匹配

3 个答案: