选择包含特定单词的句子

时间:2018-10-18 21:41:03

标签: r quanteda

quanteda中,是否有一种方法可以在两个单词同时出现的情况下选择一个句子?我找到了将文本语料库标记成句子的方法。玩kwictokens_select似乎表明它们对这两个词执行逻辑或,而不是与。

我可以用stringr做题,但我想确保自己没有错过任何事情

带有字符串的示例:

library(tidyverse)

myStr <- c("soil carbon is the best", 
           "biodiversity is key", 
           "soil carbon is biodiversity by nature")

keyw <- c("soil","biodiversity")

tibble(sentences = myStr,
       hit_soil_carbon_biodiveristy = unlist(purrr::map(myStr,~all(str_detect(.x,keyw)))))

谢谢您的投入!

1 个答案:

答案 0 :(得分:2)

是-您可以使用kwic()隔离词组(序列),然后将所选句子重新组成仅包含所选句子的新语料库。通过设置kwic window = 1000,您可以确保选择非常长的句子(2000 + 2个标记)。

library("quanteda")

# reformat the corpus as sentences
sentcorp <- corpus_reshape(data_corpus_inaugural, to = "sentences")
tail(texts(sentcorp))
#                                           2017-Trump.83 
#          "Together, we will make America strong again." 
#                                           2017-Trump.84 
#                   "We will make America wealthy again." 
#                                           2017-Trump.85 
#                     "We will make America proud again." 
#                                           2017-Trump.86 
#                      "We will make America safe again." 
#                                           2017-Trump.87 
# "And, yes, together, we will make America great again." 
#                                           2017-Trump.88 
#      "Thank you, God bless you, and God bless America." 

# illustrate the selection
kwic(sentcorp, phrase("nuclear w*"), window = 3)
# [1977-Carter.47, 18:19]  elimination of all | nuclear weapons | from this Earth
# [1985-Reagan.88, 12:13] further increase of | nuclear weapons | .              
#  [1985-Reagan.90, 9:10]          one day of | nuclear weapons | from the face  
# [1985-Reagan.91, 27:28]          the use of | nuclear weapons | , the other    
#   [1985-Reagan.96, 4:5]     It would render | nuclear weapons | obsolete.  

# now pipe the longer kwic results back into a corpus
newsentcorp <- 
    kwic(sentcorp, phrase("nuclear w*"), window = 1000) %>%
    corpus(split_context = FALSE) %>%
    texts()
newsentcorp[-4]  # because 4 is really long    
#                                                                                                   1977-Carter.47.L18 
# "And we will move this year a step toward ultimate goal - - the elimination of all nuclear weapons from this Earth." 
#                                                                                                   1985-Reagan.88.L12 
#                                        "We are not just discussing limits on a further increase of nuclear weapons." 
#                                                                                                    1985-Reagan.90.L9 
#                               "We seek the total elimination one day of nuclear weapons from the face of the Earth." 
#                                                                                                    1985-Reagan.96.L4 
#                                                                          "It would render nuclear weapons obsolete."