Question

我正在使用Python中的正则表达式匹配str中的数字。我的愿望是捕获可能带有千位分隔符（对于我来说，是逗号或空格）或只能是一串数字的数字。下面显示了我的正则表达式捕获的内容

>>> import re
>>> test = '3,254,236,948,348.884423 cold things, ' + \
'123,242 falling birds, .84973 of a French pen , ' + \
'65 243 turtle gloves, 8 001 457.2328009 units, and ' + \
'8d523c.'
>>> matches = re.finditer(ANY_NUMBER_SRCH, test, flags=re.MULTILINE)
>>> for match in matches:
...   print (str(match))
...
<_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
<_sre.SRE_Match object; span=(27, 34), match='123,242'>
<_sre.SRE_Match object; span=(37, 43), match='.84973'>
<_sre.SRE_Match object; span=(46, 52), match='65 243'>
<_sre.SRE_Match object; span=(55, 72), match='8 001 457.2328009'>
<_sre.SRE_Match object; span=(73, 74), match='8'>
<_sre.SRE_Match object; span=(75, 78), match='523'>

这是我想要的匹配行为。现在，我要获取每个匹配的数字，并删除成千上万个分隔符（','或' '）（如果存在）。这应该留给我

'3254236948348.884423 cold things, ' + \
'123242 falling birds, .84973 of a French pen ,' + \
'65243 turtle gloves, 8001457.2328009 units, ' + \
'and 8d523c.'

基本上，我有一个正则表达式来捕获数字。此正则表达式可在多个地方使用，例如查找美元金额，获取序数...出于这个原因，我将正则表达式命名为ANY_NUMBER_SRCH。

我想做的事情如下：

matches = some_method_to_get_all_matches(ANY_NUMBER_SRCH)
for match in matches:
  corrected_match = re.sub(r"[, ]", "", match)
  change_match_to_corrected_match_in_the_test_string

实际上，我不能使用替换组。如果您只想查看正则表达式，可以查看https://regex101.com/r/AzChEE/3，基本上，我的部分正则表达式如下

r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)"

我将在没有“滚动线”的情况下表示这一点：

(r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})"
  "(?P<thousands_separator>[ ,])"
  "(?P<three_digits_w_sep>(?P<three_digits>\d{3})"
  "(?P=thousands_separator))*"
  "(?P<last_group_of_three>\d{3})(?!\d)")

由于three_digits_with_separator用于重复捕获组，因此正则表达式引擎不会保留重复的*。

我敢肯定有一种方法可以使用span的{{1}}部分。但是，这会涉及到很多问题，而且我正在处理包含成千上万个字符的字符串。 在_sre.SRE_Match object或re.sub之后是否有一种简单的方法来进行re.match或使用其他任何一种方法来查找数字模式？

@abarnert使用lambda函数为我提供了正确的答案。我在@abarnert's answer下的评论，以“已验证！”开头显示所有步骤。

我的尝试

顺便说一句，我已经在SO上研究了这些问题（replace portion of match，extract part of a match，replace after matching pattern，repeated capturing group stuff），但它们只是说明如何使用替换组。我还尝试过使用re.iter，如下所示，结果如下。

re.finditer

大正则表达式

如果regex101.com link发生问题，这是巨大的正则表达式：

>>> matches = re.finditer(lib_re.ANY_NUMBER_SRCH, test, flags=re.MULTILINE)     
>>> for match in matches:
...   print ("match: " + str(match))
...   corrected_match = re.sub(r"[, ]", "", match)
...   print ("corrected_match: " + str(corrected_match))
...
match: <_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
>>>   print ("corrected_match: " + str(corrected_match))

Answer 1

我看不出有什么原因不能只使用re.sub而不是re.finditer。您的repl每次匹配都会应用一次，并且返回将每个pattern用repl中的string替换的结果，这正是您想要的。

我实际上不能运行您的示例，因为复制和粘贴test给我一个SyntaxError，而复制和粘贴ANY_NUMBER_SRCH给我一个编译正则表达式的错误，并且我不想尝试修复所有错误，其中大部分可能甚至不在您的真实代码中。因此，让我举一个简单的例子：

>>> test = '3,254,236,948,348.884423 cold things and 8d523c'
>>> pattern = re.compile(r'[\d,]+')
>>> pattern.findall(test) # just to verify that it works
['3,254,236,948,348', '884423', '8', '523']
>>> pattern.sub(lambda match: match.group().replace(',', ''), test)
'3254236948348.884423 cold things and 8d523c'

很显然，您的repl函数比删除所有逗号要复杂一些，并且您可能想def脱机使用它而不是尝试将其塞入lambda。但是不管您的规则是什么，如果将其编写为一个函数，它接受一个match对象并返回想要的字符串来代替该匹配对象，则可以将该函数传递给sub。 / p>

匹配后重新订阅。重复匹配组的所有实例，python

我的尝试

大正则表达式

1 个答案: