Python正则表达式:捕获组捕获/覆盖后续匹配

时间:2013-10-28 22:12:07

标签: python regex

在正则表达式中,我如何匹配任意数量的任何字符(例如,(。| \ n)*)而不消耗其他可能跟随的匹配?如果这个问题不明确,我的情况就是这样:

在一个文本文件中,我有一堆电子邮件,包括所有粘贴在一起的标题。

修改:下面的清洁版本在换行符的开头有每个标题。我的实际数据可能是也可能不是。每个标题组件(如“From:xxx”)可以在任何内容之前或之前。在某些情况下,许多电子邮件和标题可能都在一行上,在一堆其他残余之后。最重要的是,我需要识别其他电子邮件标题,其中包含“发件人:”。所以,我需要识别这个整个标题样式。

在编辑之前给出的几个答案依赖于^或制表符分隔等内容,我不能指望它。他们看起来似乎可能会稍加修改,但我(显然)对正则表达式并不是很好,我自己也无法调整它们。我很抱歉之前省略了这一点,只有几个答复者才能抓住它...另一种我对正则表达式缺乏经验的产品。

这是一个丑陋的版本 - 这是我实际上想要匹配的字符串。它包含两个标题和消息。

emailsString = u"""From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 1:48 PM\n     To:\n     Othername, Name\n     Subject:\n     RE: Center update\n    Message message message.\n    Such a lovely message\n    Take care,\n    Firstname Lastname, MS\n     Long signature\n     in this email\n   \n    E-mail:\n     email@email.com\n     Web\n     my blog\n     From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 9:33 AM\n     To:\n     Othername, Name\n     Subject:\n     Center update\n     Importance:\n     High\n    Good Morning Name,\n    I hope this finds you doing well.\n    I wanted to inform you of some changes. The Center will be closing August 30\n     th\n     .  or September 1\n     st\n     .  I\u2019ve enjoyed my experience. """

这是一个更清晰的版本,用于显示标题的内容

From: Lastname, Firstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: Othername, Name
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: Lastname, Firstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: Othername, Name
Subject: blahblah

message

...

我正在尝试将标题中的信息与消息本身一起使用。我有一个可以成功匹配所有标题的正则表达式,但我正在努力解决这个问题。问题是,消息可以包含任何内容(或任何内容)。可能有多个新行,等等。我想得到所有这些,但我仍然想要分割电子邮件。我的尝试(注意标题的'重要'部分是可选的):

for hit in re.finditer(r'[\s\n]*From:[\s\n]*(?P<from>.*)[\s\n]*Sent:[\s\n]*(?P<date>.*)[\s\n]*To:[\s\n]*(?P<to>.*)[\s\n]*Subject:[\s\n]*(?P<subject>.*)[\s\n]*(?:Importance:)?[\s\n]*.*[\s\n]*(?P<message>(.|\n)*)', allEmailsString):
    print "from: " + hit.group("from")
    print "to: " + hit.group("to")
    print "date: " + hit.group("date")
    print "subject: " + hit.group("subject")
    print "message: " + hit.group("message")

问题是,消息组正在抓取所有内容。因此,我正确地从/到/ etc获取第一个电子邮件标题,然后查看包含该电子邮件消息的消息,以及所有以下电子邮件标题和消息。我需要抓住'所有内容直到下一个电子邮件标题/正则表达式匹配或直到字符串结尾'。

我已经有了一个解决方法 - 我可以摆脱消息捕获组并只抓取标题。然后,遍历匹配对象并根据字符串的开始/结束对字符串进行切片。例如,message1来自match1.end到match2.start。

所以,我问......

  • 我是否可以通过在正则表达式中捕获组来实现此目的?
  • 有更好的解决方法吗?

3 个答案:

答案 0 :(得分:1)

只有当文本由可变部分和稳定部分组成时(或者至少部分具有稳定的可变性......),正则表达式才可用于提取文本块。

在下面的正则表达式模式中,我在“稳定”部分做了一些假设来提高它们的数量,从而可以区分电子邮件并在文本中提取所需的块,这些文本看起来几乎没有确定的锚点: / p>

  • 我认为在'发送'部分,总有一个星期的名字

  • 我认为如果存在“重要性”这一行,那么只有一个词来描述这种重要性,那么[^ \t\r\n]+

  • 我认为主题描述不能在几行上,然后是[^\r\n]+

如果文本中稳定部分的数量太少,也就是说文本的结构太松,使用正则表达式就不可能了。

模式[ \t\r\n]*(?P<from>.*?[^ \t\r\n])[ \t\r\n]*'对捕获的群组产生strip影响 然后,如果消息中有多个空白行,则匹配结果表明消息为''

如果在最后一条消息之后没有其他行,则需要\Z来捕获桅杆电子邮件,如我的文本示例所示。

import re


emailsString = (u'     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 1:48 PM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     RE: Center update\n'
                '    Message message message.\n'
                '    Such a lovely message\n'
                '    Take care,\n'
                '    Firstname Lastname, MS\n'
                '     Long signature\n'
                '     in this email\n'
                '   \n'
                '    E-mail:\n'
                '     email@email.com\n'
                '     Web\n'
                '     my blog\n'
                '     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 9:33 AM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     Center update\n'
                '     Importance:\n'
                '     High\n'
                '    Good Morning Name,\n'
                '    I hope this finds you doing well.\n'
                '    I wanted to inform you of some changes. The Center will be closing August 30\n'
                '     th\n'
                '     .  or September 1\n'
                '     st\n'
                '     .  I\u2019ve enjoyed my experience. ')


allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon



From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo

Nothing to say. '''



dispat = ("*  from: {from}\n"
          "*  to: {to}\n"
          "*  date: {date}\n"
          "*  subject: {subject}\n"
          "** message (beginning on next line):\n{message}\n"
          "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")



regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Sent:[ \t\r\n]*'
                  '(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
                  '[ \t\r\n]*'
                  '(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
                  '[ \t\r\n]*'
                  '(?P<message>.*?)'
                  '(?=[ \t\r\n]*From:.*?'
                  'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
                  'To.*?Subject:|\Z)',
                  re.DOTALL)


for s in (emailsString,allEmailsString):
    print ''.join(dispat.format(**d)
                  for d in (ma.groupdict('') for ma in regx.finditer(s)))
    print '\n#######################################\n'

结果

*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 1:48 PM
*  subject: RE: Center update
** message (beginning on next line):
Message message message.
    Such a lovely message
    Take care,
    Firstname Lastname, MS
     Long signature
     in this email

    E-mail:
     email@email.com
     Web
     my blog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 9:33 AM
*  subject: Center update
** message (beginning on next line):
Good Morning Name,
    I hope this finds you doing well.
    I wanted to inform you of some changes. The Center will be closing August 30
     th
     .  or September 1
     st
     .  I\u2019ve enjoyed my experience. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################

*  from: FirstLastname, FirstFirstname
*  to: TheOne
*  date: Monday, July 15th, 2011, 9:36 AM
*  subject: blah
** message (beginning on next line):
Message message message
second line of message

second para of message
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: MidLastname, MidFirstname
*  to: TWOTWO
*  date: Thursday, July 18th, 2011, 10:45 AM
*  subject: once upon
** message (beginning on next line):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: LastLastname, LastFirstname
*  to: Mr Three
*  date: Saturday, July 20th, 2011, 12:51 AM
*  subject: blobloblo
** message (beginning on next line):
Nothing to say. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################

答案 1 :(得分:0)

我只是划分(split)并征服(re.match):

import re

# `data` is your text file
delimiter = r'(^|\n)From:'
capturer = re.compile(r'From:[\n\s]*(?P<from>.*)[\n\s]*'
                      r'Sent:[\n\s]*(?P<date>.*)[\n\s]*'
                      r'To:[\n\s]*(?P<to>.*)[\n\s]*'
                      r'Subject:[\n\s]*(?P<subject>.*)[\n\s]*'
                      r'(?:Importance:)?[\n\s]*.*[\n\s]*'
                      r'(?P<message>(\n|.)*)')

raw_emails = ['From:' + d for d in re.split(delimiter, data) if d.strip()]
emails = []
for raw_email in raw_emails:
    parts = capturer.match(raw_email)
    emails.append(parts.groupdict())

对于您的示例数据,此输出:

[{'date': 'Monday, July 15th, 2011, 9:36 AM',
  'from': 'Lastname, Firstname',
  'message': 'Message message message\nsecond line of message\n\nsecond para of message\n',
  'subject': 'blah',
  'to': 'Othername, Name'},
 {'date': 'Thursday, July 18th, 2011, 10:45 AM',
  'from': 'Lastname, Firstname',
  'message': '...\n',
  'subject': 'blahblah',
  'to': 'Othername, Name'}]

答案 2 :(得分:0)

这看起来可能很痛苦。为了清晰起见,它进行了扩展 使用多线模式和No-DotAll。

@mobabo - 在第一次评论后编辑到此。

必须明确界定您的关键字,并且有。你的陈述 I can't count on things like '^From' to work显示您没有查看上一个 正则表达式,这一部分是相同的。 ^[^\S\n]*From:^From

不同

此外,主题和留言之间没有明确的界限 或重要性和消息。如果“重要性”是电子邮件的一部分,则主题具有终点。

我制作了一个正则表达式,用于处理脏乱的电子邮件,底部是Perl 运动它的程序。输出包括在内。看看是否可以解决您的问题 (见下文)。

不幸的是,这是你能想到的最好的。

祝你好运! (注意 - 如果Python有递归,这个正则表达式将是这个大小的1/4)

 # Compressed
 # -------------------
 #  ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?

 # Expanded
 # -------------------
 #

 ^ [^\S\n]* From: \s* 
 (?P<from>
      (?:
           (?!
                \s* ^ [^\S\n]* 
                (?: From | Sent | To | Subject | Importance )
                :
           )
           [\S\s] 
      )*
 )

 (?:
      \s* ^ [^\S\n]* Sent: \s* 
      (?P<sent>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* To: \s* 
      (?P<to>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* Subject: \s* 
      (?P<subject>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?:
                          (?: From | Sent | To | Subject | Importance )
                     )
                     :
                )
                [\S\s] 
           )*
      )

      (?:
           \s* ^ [^\S\n]* Importance: \s* 
           (?P<importance>
                (?:
                     (?!
                          \s* ^ [^\S\n]* 
                          (?: From | Sent | To | Subject | Importance )
                          :
                     )
                     [\S\s] 
                )*
           )
      )?
 )?


 # // Output from Perl sample code (below)
 # //
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, July 15th, 2011, 9:36 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         blah
 # // Importance/Message:
 # //         High
 # // 
 # // Message message message
 # // second line of message
 # // 
 # // second para of message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Thursday, July 18th, 2011, 10:45 AM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         blahblah
 # // 
 # // message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 1:48 PM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         RE: Center update
 # //     Message message message.
 # //     Such a lovely message
 # //     Take care,
 # //     Firstname Lastname, MS
 # //      Long signature
 # //      in this email
 # // 
 # //     E-mail:
 # //      email@email.com
 # //      Web
 # //      my blog
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 9:33 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         Center update
 # // Importance/Message:
 # //         High
 # //     Good Morning Name,
 # //     I hope this finds you doing well.
 # //     I wanted to inform you of some changes. The Center will be closing August 30
 # // 
 # //      th
 # //      .  or September 1
 # //      st
 # //      .  I've enjoyed my experience.
 # // 

 # ------------------------------------------------------------
 # # Perl sample code
 # use strict;
 # use warnings;
 # 
 # $/ = undef;
 # 
 # my $str = <DATA>;
 # 
 # 
 # 
 # while ( $str =~ /
 #     ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
 # /xmg)
 # 
 # {
 #  print "\n\n======================\n";
 #  print "From: \n\t$+{from}\n";
 #  if (defined $+{sent})
 #  {
 #      print "Sent: \n\t$+{sent}\n";
 #  }
 #  if (defined $+{to})
 #  {
 #      print "To: \n\t$+{to}\n";
 #  }
 #  if (defined $+{importance})
 #  {
 #      print "Subject: \n\t$+{subject}\n";
 #      print "Importance/Message: \n\t$+{importance}\n";
 #  }
 #  elsif (defined $+{subject})
 #  {
 #      print "Subject/Message: \n\t$+{subject}\n";
 #  }
 # }
 # 
 # 
 # __DATA__
 # 
 # From: Lastname, Firstname
 # Sent: Monday, July 15th, 2011, 9:36 AM
 # To: Othername, Name
 # Subject: blah
 # Importance: High
 # 
 # Message message message
 # second line of message
 # 
 # second para of message
 # 
 # From: Lastname, Firstname
 # Sent: Thursday, July 18th, 2011, 10:45 AM
 # To: Othername, Name
 # Subject: blahblah
 # 
 # message
 # 
 # 
 # 
 # 
 # 
 # From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 1:48 PM
 #      To:
 #      Othername, Name
 #      Subject:
 #      RE: Center update
 #     Message message message.
 #     Such a lovely message
 #     Take care,
 #     Firstname Lastname, MS
 #      Long signature
 #      in this email
 #    
 #     E-mail:
 #      email@email.com
 #      Web
 #      my blog
 #      From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 9:33 AM
 #      To:
 #      Othername, Name
 #      Subject:
 #      Center update
 #      Importance:
 #      High
 #     Good Morning Name,
 #     I hope this finds you doing well.
 #     I wanted to inform you of some changes. The Center will be closing August 30
 #      th
 #      .  or September 1
 #      st
 #      .  I've enjoyed my experience.
 # 
 # 
相关问题