我在python中的正则表达式没有正确递归

时间:2009-06-05 09:21:27

标签: python regex recursion

我想要捕获标签内的所有内容以及它后面的下一行,但它假设在下次遇到括号时停止。我做错了什么?

import re #regex

regex = re.compile(r"""
         ^                    # Must start in a newline first
         \[\b(.*)\b\]         # Get what's enclosed in brackets 
         \n                   # only capture bracket if a newline is next
         (\b(?:.|\s)*(?!\[))  # should read: anyword that doesn't precede a bracket
       """, re.MULTILINE | re.VERBOSE)

haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]

[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m

我想要的是:
[('tab1','这是捕获的\ n但是这也可以被捕获!\ n @ [这应该被拍摄,因为这是在内容中] \ n','[tab2]','帮助我\ n \ n更好的RE \ n')]

编辑:

regex = re.compile(r"""
             ^           # Must start in a newline first
             \[(.*?)\]   # Get what's enclosed in brackets 
             \n          # only capture bracket if a newline is next
             ([^\[]*)    # stop reading at opening bracket
        """, re.MULTILINE | re.VERBOSE)

这似乎有效但它也在修剪内容中的括号。

3 个答案:

答案 0 :(得分:3)

Python正则表达式不支持递归afaik。

编辑:但在你的情况下,这将有效:

regex = re.compile(r"""
         ^           # Must start in a newline first
         \[(.*?)\]   # Get what's enclosed in brackets 
         \n          # only capture bracket if a newline is next
         ([^\[]*)    # stop reading at opening bracket
    """, re.MULTILINE | re.VERBOSE)

编辑2:是的,它无法正常工作。

import re

regex = re.compile(r"""
    (?:^|\n)\[             # tag's opening bracket  
        ([^\]\n]*)         # 1. text between brackets
    \]\n                   # tag's closing bracket
    (.*?)                  # 2. text between the tags
    (?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
    """, re.DOTALL | re.VERBOSE)

haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag

[tag2]
help me
write a better RE[[[]
"""

print regex.findall(haystack)

我确实同意viraptor。正则表达式很酷但你无法检查你的文件是否有错误。也许混合动力? :P

tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))

result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
    result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()

print result

编辑3:那是因为^字符仅表示[^squarebrackets]内的否定匹配。其他任何地方都意味着字符串开始(或以re.MULTILINE开头)。正则表达式中的负字符串匹配没有好方法,只有字符。

答案 1 :(得分:3)

首先,如果你想解析一个正则表达式?正如您所看到的,您自己找不到问题的根源,因为正则表达式没有给出任何反馈。此外,RE中没有任何递归。

让你的生活变得简单:

def ini_parse(src):
   in_block = None
   contents = {}
   for line in src.split("\n"):
      if line.startswith('[') and line.endswith(']'):
         in_block = line[1:len(line)-1]
         contents[in_block] = ""
      elif in_block is not None:
         contents[in_block] += line + "\n"
      elif line.strip() != "":
         raise Exception("content out of block")
   return contents

您可以通过异常获得错误处理,并可以将执行调试作为奖励。您还可以获得字典作为结果,并且可以在处理时处理重复的部分。我的结果:

{'tab2': 'help me\nwrite a better RE\n\n',
 'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}

这些天RE过度使用......

答案 2 :(得分:2)

这样做你想要的吗?

regex = re.compile(r"""
         ^                      # Must start in a newline first
         \[\b(.*)\b\]           # Get what's enclosed in brackets 
         \n                     # only capture bracket if a newline is next
         ([^[]*)
       """, re.MULTILINE | re.VERBOSE)

这给出了一个元组列表(每个匹配一个2元组)。如果你想要一个扁平的元组,你可以写:

m = sum(regex.findall(haystack), ())