多行正则表达式模式匹配

时间:2014-04-18 19:29:11

标签: python regex

我从进程的输出中得到以下多行(?)字符串。

  

04/18 @ 14:22 - 响应192.68.10.1:
  04/18 @ 14:22 - 响应192.68.10.1:
  TSB1文件名:OCAP_TSB_76 04/18 @ 14:22 - 响应   从192.68.10.1:TSB1   持续时间:1752秒04/18 @ 14:22 - 响应来自   192.68.10.1:TSB1比特率:3669 kbps 04/18 @ 14:22 - 响应192.68.10.1:
  04/18 @ 14:22 - 响应192.68.10.1:
  TSB2文件名:OCAP_TSB_80 04/18 @ 14:22 - 响应   从192.68.10.1:TSB2   持续时间:56秒04/18 @ 14:22 - 响应来自   192.68.10.1:TSB2比特率:3675 ​​kbps 04/18 @ 14:22 - 响应192.68.10.1:

我正在尝试仅提取'秒'中的值。和' kbps'。

这是我到目前为止所做的。

>>> cpat = re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds\.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)
>>> m = re.findall(cpat,txt)
>>> m
[]

如果我将正则表达式分成单独的部分,我会找到匹配项。 但是,我希望找到如下的匹配

  
    
      

米       [(1752,3669),(52,3675)]

    
  

非常感谢!

3 个答案:

答案 0 :(得分:3)

re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds\.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)
                                                                       ^

我认为这个点并不意味着被转义(因为否则,它将匹配文字点而不是任何字符。尝试使用:

re.compile(r"\.*RESPONSE from[^:]+:\s*TSB[\d] Duration:\s*(\d+) seconds.*?RESPONSE from[^:]+:\s*TSB[\d] Bit Rate:\s*(\d+) kbps", re.DOTALL)

此外,您的正则表达式中有一些不必要的部分,您可以删除并仍然确保您正在寻找的匹配项。我在下面的正则表达式中删除了它们:

re.compile(r"RESPONSE from[^:]+:\s*TSB\d Duration:\s*(\d+) seconds.*?RESPONSE from[^:]+:\s*TSB\d Bit Rate:\s*(\d+) kbps", re.DOTALL)

即:

  • .*的正则表达式开头,您不需要re.findall
  • 如果单独使用\d,则无需将其放在方括号内。

答案 1 :(得分:2)

此代码提供您想要的内容:

导入重新

data = '''
04/18@14:22 - RESPONSE from 192.68.10.1 :
04/18@14:22 - RESPONSE from 192.68.10.1 :
TSB1 File Name: OCAP_TSB_76 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Duration: 1752 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB1 Bit Rate: 3669 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 :
04/18@14:22 - RESPONSE from 192.68.10.1 :
TSB2 File Name: OCAP_TSB_80 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Duration: 56 seconds 04/18@14:22 - RESPONSE from 192.68.10.1 : TSB2 Bit Rate: 3675 kbps 04/18@14:22 - RESPONSE from 192.68.10.1 :
'''

output = []
block_pattern = re.compile(r'(\d+\/\d+@\d+:\d+ - RESPONSE.*?)(.*)')
seconds_speed_pattern = re.compile(r'TSB.*Duration:(.*)seconds.*TSB.*Bit Rate:(.*)kbps')
blocks = re.findall(block_pattern, data)
for block in blocks:
    ss_data = re.findall(seconds_speed_pattern, block[1])
    if ss_data:
        output.append(ss_data[0])

print output

打印

[(' 1752 ', ' 3669 '), (' 56 ', ' 3675 ')]

要将这些值从str转换为int,请执行以下操作:

output = [(int(a.strip()), int(b.strip())) for a, b  in output]

这给出了:

[(1752, 3669), (56, 3675)]

答案 2 :(得分:1)

result = re.findall(r"(?sim)Duration: (\d+).*?Rate: (\d+)", subject)


Options: dot matches newline; case insensitive; ^ and $ match at line breaks

Match the characters “Duration: ” literally «Duration: »
Match the regular expression below and capture its match into backreference number 1 «(\d+)»
   Match a single digit 0..9 «\d+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “Rate: ” literally «Rate: »
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
   Match a single digit 0..9 «\d+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»