Question

我正在尝试了解正则表达式中的非捕获组。

如果我有以下输入：

He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!

如果我想提取每个句子中的第一个单词，我试图使用匹配模式：

^(\w+\b.*?)|[\.!\?]\s+(\w+)

将所需的输出放在子匹配中。

Match   $1
He      He  
. Then  Then
. The   The
! How   How
? I     I

但我认为使用非捕获组，我应该能够让他们回到比赛中。

我试过了：

^(?:\w+\b.*?)|(?:[\.!\?]\s+)(\w+)

然后产生了：

Match   $1
He  
. Then  Then
. The   The
! How   How
? I     I

和 ^（？：？\ W + \ B *）|（？：[！？\] \ S +）\ W +

得到：

Match
He
. Then
. The
! How
? I

我错过了什么？

（我正在使用RegExLib.com测试我的正则表达式，但会将其传输到VBA）。

Answer 1

针对字符串“foo”的简单示例：

(f)(o+)

会产生$1 ='f'和$2 ='oo';

(?:f)(o+)

这里，$1 ='oo'，因为你明确表示不会捕获第一个匹配组。并且没有第二个匹配组。

对于您的方案，这感觉恰到好处：

(?:(\w+).*?[\.\?!] {2}?)

请注意，最外面的组是非捕获组，而内部组（句子的第一个单词）正在捕获。

Answer 2

以下构造边界条件的非捕获组，并使用捕获组捕获其后的单词。

(?:^|[.?!]\s*)(\w+)

您不清楚如何将正则表达式应用于文本，但是您的常规“拉出另一个直到没有更多匹配”循环应该有效。

Answer 3

这有效且简单：

([A-Z])\w*

VBA需要以下标志设置：

Global = True 'Match all occurrences not just first
IgnoreCase = False 'First word of each sentence starts with a capital letter

这里还有一些来之不易的信息：由于您的正则表达式至少设置了一个括号，因此您可以使用Submatches仅提取括号中的值，而忽略其余的值-非常有用。这是我用来获取子匹配项并在您的字符串上运行的函数的调试输出：

theMatches.Count=5
Match='He'
   Submatch Count=1
   Submatch='H'
Match='Then'
   Submatch Count=1
   Submatch='T'
Match='The'
   Submatch Count=1
   Submatch='T'
Match='How'
   Submatch Count=1
   Submatch='H'
Match='I'
   Submatch Count=1
   Submatch='I'

T

这里是对我的函数的调用，返回了上面的内容：

sText = "He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!"
sRegEx = "([A-Z])\w*"
Debug.Print ExecuteRegexCapture(sText, sRegEx, 2, 0) '3rd match, 1st Submatch

这是函数：

'Returns Submatch specified by the passed zero-based indices:
'iMatch is which match you want,
'iSubmatch is the index within the match of the parenthesis
'containing the desired results.
Function ExecuteRegexCapture(sStringToSearch, sRegEx, iMatch, iSubmatch)
   Dim oRegex As Object
   Set oRegex = New RegExp
   oRegex.Pattern = sRegEx
   oRegex.Global = True 'True = find all matches, not just first
   oRegex.IgnoreCase = False
   oRegex.Multiline = True 'True = [\r\n] matches across line breaks, e.g. "([\r\n].*)" will match next line + anything on it
   bDebug = True

   ExecuteRegexCapture = ""

   Set theMatches = oRegex.Execute(sStringToSearch)
   If bDebug Then Debug.Print "theMatches.Count=" & theMatches.Count

   For i = 0 To theMatches.Count - 1
      If bDebug Then Debug.Print "Match='" & theMatches(i) & "'"
      If bDebug Then Debug.Print "   Submatch Count=" & theMatches(i).SubMatches.Count
      For j = 0 To theMatches(i).SubMatches.Count - 1
         If bDebug Then Debug.Print "   Submatch='" & theMatches(i).SubMatches(j) & "'"
      Next j
   Next i

   If bDebug Then Debug.Print ""

   If iMatch < theMatches.Count Then
      If iSubmatch < theMatches(iMatch).SubMatches.Count Then
         ExecuteRegexCapture = theMatches(iMatch).SubMatches(iSubmatch)
      End If
   End If
End Function

具有非捕获组的正则表达式

3 个答案: