r gsub在一个术语之前和之后提取n个单词

时间:2018-03-30 16:09:05

标签: r gsub

我需要提取在术语之前和之后出现的n个单词,用于我正在处理的文本分析。以下是一个可重复的例子:

a <- c("The day was nice and dry, when she came for our game we were ready and then she left.",
"The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes.",
"The day was nice and dry, when she came, we were not here. Our game  was not completed timely, but it was completed after one hour.")

以下是我使用的功能,但它不适用于在单词或双重空格周围有标点符号的情况。

gsub(".*(( \\w{1,}){3} game( \\w{1,}){3}).*", "\\1", a, perl = TRUE)


[1] " came for our game we were ready"                                                                                                  
[2] "The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes."                 
[3] "The day was nice and dry, when she came, we were not here. Our game  was was not completed timely, but it was completed after one hour."

以下是所需的输出

[1] " came for our game we were ready"                                                                                                  
[2] " came for our game, but we were"                 
[3] " not here. Our game was not completed"

2 个答案:

答案 0 :(得分:2)

请尝试\\W{1,}

,而不是使用空格
gsub(".*(((\\W{1,})\\w{1,}){3} game((\\W{1,})\\w{1,}){3}).*", "\\1", a, perl = TRUE)

[1] " came for our game we were ready"       
" came for our game, but we were"        
" not here. Our game  was not completed"

答案 1 :(得分:0)

这是str_extract包中stringr的另一种方法:

library(stringr)

str_extract(a, "(( \\S+){3} game[[:punct:]\\s]*( \\S+){3})")

# [1] " came for our game we were ready"       
#     " came for our game, but we were"        
#     " not here. Our game  was not completed"