正则表达式与模式不匹配

时间:2019-05-22 09:53:00

标签: python regex python-3.x google-colaboratory

我正在尝试为以下数据创建一个正则表达式

12/07/16, 2:18 AM - ABC1: Anyway... this is ... abc: !?

:) Yea, this is next line - Multi line statements
12/07/16, 2:19 AM - User27: John, Bob, Him, I, May,2 ,3 100... multiple values
10/07/16, 2:41 PM - ABC1: Singe line statements
10/07/16, 2:41 PM - ABC1: Good
10/07/16, 2:45 PM - ABC1: Emojis statements, multiline, different languages



我的正则表达式-

(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s

上述正则表达式可以正常工作直到

12/07/16, 2:18 AM - 

我尝试处理最后一位(用户名和消息)-

(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s(^[A-Z][0-9]$)

无法选择邮件或用户名。

我正在努力为消息片段创建正则表达式,因为它涉及换行符,空格,表情符号,不同的语言,而且我不知道USERNAME或MESSAGE的长度。

我正在使用Debugger验证我的正则表达式和此cheatsheet

我愿意接受任何改进和建议。谢谢!

2 个答案:

答案 0 :(得分:0)

This是对您的正则表达式的修改

(?s)(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s(User\d+):\s*(.*?)(?=(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s|\Z)

正则表达式细分

(?s) #Dot matches new line
(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s #Same as above
(User\d+)\s*:\s* #Match username followed by :
(.*?) #Find the message lazily till the below conditions
(?=
   (?:\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s  #Till the same format is found
   |
  \Z #or we reach end of string
)

编辑:如评论中所述,文件应该在单个变量的内存中

答案 1 :(得分:0)

您不必将整个文件读入内存。您可以逐行读取文件,检查起始行模式是否匹配,如果不是以该模式开头的行,则继续在临时字符串中添加行,然后追加到结果中(或写入另一个文件,数据框,等),找到与日期时间模式匹配的文件末尾或另一行:

import re
values = []
start_matching = False
val = ""
r=re.compile(r"\d{1,2}/\d{2}/\d{2},\s\d{1,2}:\d{2}\s\w{2}\s-\s")
with open('path/to/file', 'r') as f:
  for line in f:
    if r.match(line.strip()):
      start_matching = True
      if val:
        values.append(val.rstrip()) # stripping trailing whitespace and write to result
        val = ""
      val += line
    else:
      if start_matching:
        val += line

if val:
  values.append(val.rstrip()) # stripping trailing whitespace and write the tail to result

如果您使用

for v in values:
  print(v)
  print("-------")

输出将是

12/07/16, 2:18 AM - ABC1: Anyway... this is ... abc: !?

:) Yea, this is next line - Multi line statements
-------
12/07/16, 2:19 AM - User27: John, Bob, Him, I, May,2 ,3 100... multiple values
-------
10/07/16, 2:41 PM - ABC1: Singe line statements
-------
10/07/16, 2:41 PM - ABC1: Good
-------
10/07/16, 2:45 PM - ABC1: Emojis statements, multiline, different languages



-------