aregexec与两个数据帧匹配

时间:2015-08-24 13:47:38

标签: regex r replace vectorization

一个是目标数据框(targetframe),另一个数据框用作具有一些键值的库(word.library)。然后我需要以下算法:算法应查找word.library$mainwordtargetframe$words之间的近似匹配。在计算出近似匹配后,targetframe $ words中的子串应替换为word.library$keyID

以下是上述两个数据框:

tragetframe <- data.frame(words= c("This is sentence one with the important word",
                                 "This is sentence two with the inportant woord",
                                  "This is sentence three with crazy sayings" ))

word.library <- data.frame(mainword = c("important word",
                                        "crazy sayings"),
                           keyID = c("1001",
                                     "2001"))

这是我的解决方案。

for(i in 1:nrow(word.library)){
positions <- aregexec(word.library[i,1], tragetframe$words, max.distance = 0.1)
res <- regmatches(tragetframe$words, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow
tragetframe$words <- Vectorize(gsub)(unlist(res), word.library[i,2], tragetframe$words)
tragetframe$words
}

但是:我使用了一个非常有效的for循环(假设我有两个巨大的数据帧)。有谁知道如何更有效地解决这个问题?

0 个答案:

没有答案