Question

我有一个短语列表，我想用一个相似的单词替换某些单词，以防拼写错误。

如何搜索字符串，匹配并替换它的单词？

预期结果如下：

a1<- c(" the classroom is ful ")
a2<- c(" full")

在这种情况下，我将替换完整 <1>

中的

Answer 1

我认为您正在寻找的功能是gsub（）：

gsub (pattern = "ful", replacement = a2, x = a1)

Answer 2

查看hunspell包。正如评论已经提出的那样，除非您已经有拼写错误的单词及其拼写正确的字典，否则您的问题要比看上去困难得多。

library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
#  [1] "fool" "flu"  "fl"   "fuel" "furl" "foul" "full" "fun"  "fur"  "fut"  "fol"  "fug"  "fum"

因此，即使在您的示例中，您是否要将ful替换为full，或者此处还有许多其他选项？

该软件包允许您使用自己的字典。让我们说你正在这样做，或者至少你对第一个返回的建议感到满意。

library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "

但是，正如其他评论和答案所指出的那样，你需要小心显示其他词语中显示的词。

a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3, 
                paste("\\b", 
                      hunspell(a3)[[1]], 
                      "\\b", 
                      collapse = "", sep = ""), 
                hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "

更新

根据你的评论，你已经有了一个字典，结构化为坏词的向量和另一个替换的向量。

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

更新2

使用您的新示例解决您的评论问题回到了换句话说出现的问题。解决方案是使用\\b。这代表一个单词边界。使用模式＆＃34;瘦＆＃34;它会匹配＆＃34;瘦＆＃34;，＆＃34;思考＆＃34;，＆＃34;思考＆＃34;等等。但如果你用\\b括起来，它会将模式锚定到一个单词边界。 \\bthin\\b只会匹配＆＃34; thin＆＃34;。

你的例子：

a <- c(" thin, thic, thi") 
badwords.corpus <- c("thin", "thic", "thi" ) 
goodwords.corpus <- c("think", "thick", "this")

解决方案是修改badwords.corpus

badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"

然后按我在上一次更新中描述的那样创建vect.corpus，并在str_replace_all中使用。

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a, vect.corpus)
# [1] " think, thick, this"

Answer 3

创建更正列表，然后使用gsubfn替换它们，gsub是library(gsubfn) L <- list(ful = "full") # can add more words to this list if desired gsubfn("\\b\\w+\\b", L, a1, perl = TRUE) ## [1] " the classroom is full "的概括，它也可以采用列表，函数和原型对象替换对象。正则表达式匹配单词边界，一个或多个单词字符和另一个单词边界。每次找到匹配项时，它会在列表名称中查找匹配项，如果找到则将其替换为相应的列表值。

UPDATE

Answer 4

对于一种有序的替换，你可以试试这个

a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")

qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)

对于无序替换，您可以使用近似字符串匹配（请参阅stringdist::amatch）。这是一个例子

a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"

library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
  patt <- paste0('\\b', badword, '\\b')
  repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
  final.word <- ifelse(is.na(repl), badword, repl)
  a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

替换字符串

4 个答案:

更新

更新2