正则表达式匹配大图案

时间:2015-12-31 21:28:53

标签: python regex

我是regex的新手,我试图匹配一些模式,但它适用于更少的len模式,但它被卡住了大模式(看起来像一些灾难性的回溯问题)。

下面是我的字符串,

world0 world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68
world69 world70 world71 world72 world73 world74 world75 world76 world77 world78
world79 world80 world81 world82 world83 world84 world85 world86 world87 world88
world89 world90 world91 world92 world93 world94 world95 world96 world97 world98
world99 world0 world1 world2 world3 world4 world5 world6 world7 world8 world9
world10 world11 world12 world13 world14 world15 world16 world17 world18 world19
world20 world21 world22 world23 world24 world25 world26 world27 world28 world29
world30 world31 world32 world33 world34 world35 world36 world37 world38 world39
world40 world41 world42 world43 world44 world45 world46 world47 world48 world49
world50 world51 world52 world53 world54 world55 world56 world57 world58 world59
world60 world61 world62 world63 world64 world65 world66 world67 world68 world69
world70 world71 world72 world73 world74 world75 world76 world77 world78 world79
world80 world81 world82 world83 world84 world85 world86 world87 world88 world89
world90 world91 world92 world93 world94 world95 world96 world97 world98 

现在我的匹配模式是一个字符串列表,比如match_list,我的预期输出是,它应匹配上面的子字符串,其中包含在match_list字符串中定义的所有字符串

Small list = ["world0","world1", "world2"]

我尝试了以下模式

(?=((\b(?:world0|world1|world2)\b[\w\s]*?){3}))

上面的一个工作正常,匹配的输出是正确的,我期待,

[0-20]  `world0 world1 world2`

[7-796] `world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68 world69 world70
world71 world72 world73 world74 world75 world76 world77 world78 world79 world80
world81 world82 world83 world84 world85 world86 world87 world88 world89 
world90 world91 world92 world93 world94 world95 world96 world97 world98 world99 world0`

 [14-803] `world2 world3 world4 world5 world6 world7 world8 world9 world10 world11
 world12 world13 world14 world15 world16 world17 world18 world19 world20 world21
 world22 world23 world24 world25 world26 world27 world28 world29 world30 world31
 world32 world33 world34 world35 world36 world37 world38 world39 world40 world41
 world42 world43 world44 world45 world46 world47 world48 world49 world50 world51
 world52 world53 world54 world55 world56 world57 world58 world59 world60 world61
 world62 world63 world64 world65 world66 world67 world68 world69 world70 world71
 world72 world73 world74 world75 world76 world77 world78 world79 world80 world81
 world82 world83 world84 world85 world86 world87 world88 world89 world90 world91
 world92 world93 world94 world95 world96 world97 world98 world99 world0 world1`

[790-810]   `world0 world1 world2`

但是对于大型名单= ['world0','world1','world2','world3','world4','world5','world6','world7','world8','world9',' world10','world11','world12','world13','world14','world15','world16','world17','world18','world19','world20','world21','world22' ,'world23','world24','world25','world26','world27','world28','world29','world30','world31','world32','world33','world34',' world35','world36','world37','world38','world39','world40','world41','world42','world43','world44','world45','world46','world47' ,'world48','world49']

尝试以下模式

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

这给我带来了灾难性的回溯错误。你能告诉别人做错了什么或最好的做法是什么?

2 个答案:

答案 0 :(得分:1)

首先,您的模式错误,因为它匹配world0 world0 world0

这个问题不能仅通过正则表达式来解决。如果我写一个模式(对于regex module),如:

word_list = ['world0', 'world1', 'world2']
p = regex.compile(r'''
    \m (\L<words>)
    \W++ (?>\w+\W+)*? (?!\g{-1})
    (\L<words>) 
    \W++ (*SKIP) (?>\w+\W+)*? (?!\g{-1}|\g{-2})
    (\L<words>) \M 
  ''', regex.VERBOSE, words=word_list)

for m in p.finditer(text, overlapped=True):
     print(m.group(0))

只搜索示例文本中的三个项目,我获得了一些复杂的东西(即使使用优化工作也没有效率),难以扩展到更多项目,并且可能会因更多文本或更多项目而崩溃。

另一种可能的方法是只搜索列表中的单词,并在找到所有单词后在生成器中创建文本摘录:

import regex
from collections import deque

data = '''He moved on as he spoke, and the Dormouse followed him: the March Hare moved into the Dormouse’s place, and Alice rather unwillingly took the place of the March Hare. The Hatter was the only one who got any advantage from the change: and Alice was a good deal worse off than before, as the March Hare had just upset the milk-jug into his plate.
Alice did not wish to offend the Dormouse again, so she began very cautiously: `But I don’t understand. Where did they draw the treacle from?’
`You can draw water out of a water-well,’ said the Hatter; `so I should think you could draw treacle out of a treacle-well–eh, stupid?’
`But they were IN the well,’ Alice said to the Dormouse, not choosing to notice this last remark.
`Of course they were’, said the Dormouse; `–well in.’
This answer so confused poor Alice, that she let the Dormouse go on for some time without interrupting it.
`They were learning to draw,’ the Dormouse went on, yawning and rubbing its eyes, for it was getting very sleepy; `and they drew all manner of things–everything that begins with an M–‘
`Why with an M?’ said Alice.
`Why not?’ said the March Hare.
Alice was silent.
The Dormouse had closed its eyes by this time, and was going off into a doze; but, on being pinched by the Hatter, it woke up again with a little shriek, and went on: `–that begins with an M, such as mouse-traps, and the moon, and memory, and muchness– you know you say things are “much of a muchness”–did you ever see such a thing as a drawing of a muchness?’
`Really, now you ask me,’ said Alice, very much confused, `I don’t think–‘
`Then you shouldn’t talk,’ said the Hatter.'''

word_list = ('Dormouse', 'Hatter', 'Alice')

def match_gen(word_list, text):
    p = regex.compile(r'\m\L<words>\M', words=word_list)
    d = deque()
    occlist = [0]*len(word_list)   

    for m in p.finditer(text):
        windex = word_list.index(m.group(0))
        d.append((windex, m.start()))
        occlist[windex] += 1

        while not(0 in occlist):
            elt = d.popleft()
            occlist[elt[0]] -= 1
            yield [elt[1],m.end()],text[elt[1]:m.end()]

for x in match_gen(word_list, data):
    print(x)

优点是不会有灾难性回溯的风险和少量的内存使用。

注意:我选择使用正则表达式模块而不是re模块,因为它有更方便的功能,如命名列表,overlapped标记或单词边界\m\M ,但你可以对re模块做同样的事情(但你需要使用(?=(...))进行重叠匹配,\b而不是\m\M,以及{{1建立交替)。

注意2:如果您的单词列表太长,您可以使用相同的方式,但不使用替换作为模式(即'|'.join(word_list)),仅使用\L<words>并检查每个匹配项是否为\w+在列表中。您可以像这样替换上一代码的开头:

def match_gen(word_list, text):
    p = regex.compile(r'\w+')
    d = deque()
    occlist = [0]*len(word_list)   

    for m in filter(lambda x: x.group(0) in word_list, p.finditer(text)):

答案 1 :(得分:0)

您的最后一个模式似乎要匹配数字后缀超过50的所有世界。因此,而不是

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

为什么不是以下(匹配0-49或50的所有值):

(?=((\b(?:world([0-4][0-9]?|50))\b[\w\s]*?){3}))

这是我尝试根据您的描述清理正则表达式

  

它应匹配上面的子字符串,该字符串包含在match_list字符串

中定义的所有字符串
regex = r'\bworld([0-4][0-9]?|50)\b'
matches = re.findall(regex, "world1 world2 world50 world60")
print matches  # ['world1', 'world2', 'world50']
相关问题