正则表达式匹配句子与小数和名称

时间:2014-09-21 11:59:50

标签: c# regex

我觉得我与这个非常接近,但是一旦我将标点符号捕获移到句子的末尾就会错过捕获。

句子情景如下:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a  sentence      with odd   spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.

这应该导致捕获:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

这是我的表达方式:

.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?=\D|\s+|$)(?:[^!?.;\d]|\d*\.?\d+)*)(?=(?:[!?.;]+))

目前有两个问题:

  1. 标点符号在开头
  2. 它正确地处理每个句子的一个名称而不是两个(奖励点我喜欢它以正确捕获&#34; DJ Smith先生&#34;但我无法弄清楚它是如何&# 39; t匹配以单个字母结尾的句子。
  3. 进入此数据的数据会有所规范,所以我们知道它会以一个完整的句点结束并且在一条线上,但任何指针都欢迎。

1 个答案:

答案 0 :(得分:0)

我同意@spender建议使用解析器来过滤所有标点规则。

但是,以下内容适用于您的方案。

foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:\b[A-Z]|Mrs?|Dr|Rev|\d))[!?.;]+)\s*"))
         Console.WriteLine(m.Groups[1].Value);

<强>输出

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

Ideone Demo