Question

我想从正则表达式中提取下面的第一句话。我想要实现的规则（我知道这不是通用的解决方案）是从字符串start ^中提取到（包括）第一个句点/感叹号/问号，前面是小写字母或数字。

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

到目前为止，我最好的猜测是尝试在这种情况下尝试实现非贪婪的string-before-match approach失败：

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

任何提示都非常感激。

Answer 1

您将[a-z0-9][.?!]置于非消费前瞻模式中，如果您打算使用str_extract，则需要将其消费：

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

见this regex demo。

<强>详情

.*? - 除了换行符之外的任何0 +字符
[a-z0-9] - ASCII小写字母或数字
[.?!] - .，?或!
(?= ) - 后面跟着一个文字空间。

或者，您可以使用sub：

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

请参阅this regex demo。

<强>详情

([a-z0-9][?!.]) - 第1组（在替换模式中称为\1）：ASCII小写字母或数字，然后是?，!或{{1 }}
. - 空白
\s - 任意0个字符，尽可能多（直到字符串结尾）。

Answer 2

在确定句子边界时，

corpus对缩写有特殊处理：

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.

还有一些有用的数据集，包含许多语言的常用缩写，包括英语。请参阅corpus::abbreviations_en，它可用于消除句子边界的歧义。

用字符串提取第一句

2 个答案: