如何使用R提取包含引用标记的句子

时间:2018-11-13 20:33:25

标签: r regex string

例如,我有String:

string = "The present paper describes an analysis of data from a cohort study of occupational stress in the Royal Navy (Bridger et al., 2010). Data from 2008 Phase III and 2010 Phase V of the survey were analysed to determine whether (cumulative) scores on the General Health Questionnaire (Goldberg and Williams, 1988) and the CFQ (Broadbent et al., 1982), were related to the occurrence of accidents over a three-year period (2007–2010)"

结果应该是这样的:

"The present paper describes an analysis of data from a cohort study of occupational stress in the Royal Navy (Bridger et al., 2010)."

请帮助我!

2 个答案:

答案 0 :(得分:1)

仅使用stringr包装的功能强大的基础库并充分利用它,而不是依靠拐杖和regex hacks来怎么样:

stringi::stri_split_boundaries(string, type="sentence")[[1]][1]

答案 1 :(得分:0)

您可以从以下类似内容开始: .*至少匹配0个字符 , \\d{4}\\)\\.匹配一个逗号,后跟一个空格,正好4位数字,一个括号和一个句点,例如, 2010).如果您认为字符串有可能在引文以外的其他实例中包含该序列,或者不在字符串的开头,那么您可能必须更加具体。

library(stringr)
str_extract(string,".*, \\d{4}\\)\\.")
#[1] "The present paper describes an analysis of data from a cohort study of occupational stress in the Royal Navy (Bridger et al., 2010)."