从字符向量中删除可能包含特殊字符的整个单词列表,而不匹配单词的各个部分

时间:2018-07-04 12:48:46

标签: r regex gsub stringr

我在R中有一个单词列表,如下所示:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从文本中删除上面列表中的单词,如下所示:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

在删除了不需要的myList单词后,myText应该如下所示:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用:

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

但这对我没有帮助。我该怎么办?

2 个答案:

答案 0 :(得分:1)

gsub(paste0(myList, collapse = "|"), "", myText)

给予:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."

答案 1 :(得分:1)

您可以将PCRE regex与gsub基本R函数一起使用(它也可以与str_replace_all中的ICU regex一起使用):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

请参见page in the manual

详细信息

  • \s*-0个或多个空格
  • (?<!\w)-向后隐藏,可确保在当前位置之前没有单词char
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)-一个非捕获组,在字符向量内包含转义的项,其中需要删除的单词
  • (?!\w)-否定的超前查询,可确保在当前位置后立即没有单词char。

注意:我们不能在此处使用\b字边界,因为regex demomyList字符向量中的项目可能以非单词字符开头/结尾是上下文相关的。

查看\b meaning

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

详细信息

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }-逃脱R demo online
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")-从搜索词向量创建一个|分隔的替代列表。