将字符向量拆分为句子

时间:2017-10-23 08:05:51

标签: r regex

我有以下字符向量:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

我希望使用以下模式将其拆分为句子(即句号 - 空格 - 大写字母):

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

因此,缩写后的句号不应该是新句子。我想在R。

中使用正则表达式来做到这一点

有人可以帮助我吗?

2 个答案:

答案 0 :(得分:3)

使用strsplit的解决方案:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

结果:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?" 

这匹配任何标点符号,后跟空格和大写字母。 (?<=[[:punct:]])在匹配分隔符之前将字符串中的标点符号保留在字符串中,(?=[A-Z])将匹配的大写字母添加到匹配分隔符后的字符串中。

修改 我刚刚看到你在你想要的输出中的问号后没有拆分。如果你只想在&#34;之后分开。&#34;你可以用这个:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

给出了

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"  

答案 1 :(得分:1)

您可以使用包tokenizers

library(tokenizers)
tokenize_sentences(x)

其中x是你的角色向量。它导致

[[1]]
[1] "This is a very long character vector."

[[2]]
[1] "Why is it so long?"                                                
[2] "I want to split this vector into senteces by using e.g. strssplit."

[[3]]
[1] "Can someone help me?"

[[4]]
[1] "That would be nice?"   

然后,您可以使用unlist删除列表结构。