替换字符串

时间:2017-12-04 19:49:42

标签: r string replace text-mining

我有一个短语列表,我想用一个相似的单词替换某些单词,以防拼写错误。

如何搜索字符串,匹配并替换它的单词?

预期结果如下:

a1<- c(" the classroom is ful ")
a2<- c(" full")

在这种情况下,我将替换完整 <1>

中的

4 个答案:

答案 0 :(得分:1)

我认为您正在寻找的功能是gsub():

gsub (pattern = "ful", replacement = a2, x = a1)

答案 1 :(得分:1)

查看hunspell包。正如评论已经提出的那样,除非您已经有拼写错误的单词及其拼写正确的字典,否则您的问题要比看上去困难得多。

library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
#  [1] "fool" "flu"  "fl"   "fuel" "furl" "foul" "full" "fun"  "fur"  "fut"  "fol"  "fug"  "fum" 

因此,即使在您的示例中,您是否要将ful替换为full,或者此处还有许多其他选项?

该软件包允许您使用自己的字典。让我们说你正在这样做,或者至少你对第一个返回的建议感到满意。

library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "

但是,正如其他评论和答案所指出的那样,你需要小心显示其他词语中显示的词。

a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3, 
                paste("\\b", 
                      hunspell(a3)[[1]], 
                      "\\b", 
                      collapse = "", sep = ""), 
                hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "

更新

根据你的评论,你已经有了一个字典,结构化为坏词的向量和另一个替换的向量。

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

更新2

使用您的新示例解决您的评论问题回到了换句话说出现的问题。解决方案是使用\\b。这代表一个单词边界。使用模式&#34;瘦&#34;它会匹配&#34;瘦&#34;,&#34;思考&#34;,&#34;思考&#34;等等。但如果你用\\b括起来,它会将模式锚定到一个单词边界。 \\bthin\\b只会匹配&#34; thin&#34;。

你的例子:

a <- c(" thin, thic, thi") 
badwords.corpus <- c("thin", "thic", "thi" ) 
goodwords.corpus <- c("think", "thick", "this")

解决方案是修改badwords.corpus

badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"

然后按我在上一次更新中描述的那样创建vect.corpus,并在str_replace_all中使用。

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a, vect.corpus)
# [1] " think, thick, this" 

答案 2 :(得分:0)

创建更正列表,然后使用gsubfn替换它们,gsublibrary(gsubfn) L <- list(ful = "full") # can add more words to this list if desired gsubfn("\\b\\w+\\b", L, a1, perl = TRUE) ## [1] " the classroom is full " 的概括,它也可以采用列表,函数和原型对象替换对象。正则表达式匹配单词边界,一个或多个单词字符和另一个单词边界。每次找到匹配项时,它会在列表名称中查找匹配项,如果找到则将其替换为相应的列表值。

UPDATE

答案 3 :(得分:0)

对于一种有序的替换,你可以试试这个

a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")

qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)

对于无序替换,您可以使用近似字符串匹配(请参阅stringdist::amatch)。这是一个例子

a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"

library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
  patt <- paste0('\\b', badword, '\\b')
  repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
  final.word <- ifelse(is.na(repl), badword, repl)
  a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"