r用关键字排除句子

时间:2017-10-11 18:57:05

标签: r string grepl negation

我正在处理如下句子

    Has no anorexia
    She denies anorexia
    Has anorexia
    Positive for Anorexia

我的目标是排除包含denies, denied, no等字词的句子,并仅保留厌食症的正面指示。

最终结果应为

     Has anorexia
     Positive for Anorexia

我用grepl函数

尝试了这个选项
     negation <- c("no","denies","denied")
     if (grepl(paste(negation,collapse="|"), Anorexia_sentences[j]) == TRUE){

     Anorexia_sentences[j] <- NA

     }

并且这不起作用,我认为A no rexia这个词中没有引起一些问题。任何有关如何解决此问题的建议都非常感谢。

2 个答案:

答案 0 :(得分:4)

语料库库的功能类似于 stringr 等效项,但是在 term 级别工作,而不是字符< / em>级别。这有效:

library(corpus)
negation <- c("no", "denies", "denied")
text <- c("Has no anorexia", "She denies anorexia", "Has anorexia",
          "Positive for Anorexia", "Denies anorexia")
text[!text_detect(text, negation)]
## [1] "Has anorexia"          "Positive for Anorexia"

如果您想要一个仅使用基础R的解决方案,请改用以下代码:

pattern <- paste0("\\b(", paste(negation, collapse = "|"), ")\\b")
text[!grepl(pattern, text, ignore.case = TRUE)]

答案 1 :(得分:0)

You can also do this easily using the quanteda package. To get the character object to register as sentences, you would need either punctuation, or to segment the lines into elements of a character vector. Then, we can use char_trimsentences() to remove those with a particular pattern match when tokenized.

library("quanteda")

readLines(textConnection(txt)) %>%
    char_trimsentences(exclude_pattern = c("\\bden\\w+\\b|\\bno\\b"))
##              text3                   text4 
##     "Has anorexia" "Positive for Anorexia" 

The regular expression guarantees that you will match words with the glob pattern "den*", and "no" as a word only (and not part of anorexia.