Question

我有一个结构如下的日志文件，需要在python中解析：

10.243.166.74, 10.243.166.74 - - [08/Feb/2017:16:33:26 +0100] "GET /script/header_footer.js?_=1486568008442 HTTP/1.1" 200 2143 "http://www.trendtron.com/popmenu/home" "Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/31.0 K-Meleon/75.1"

我第一次做reg。表达，我得到的就是这个：

(.+?)\[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)"

该代码生成7个字符串，但我还需要更多。期望的输出：

"10.243.166.74, 10.243.166.74"
"08/Feb/2017"
"16:33:26"
"+0100"
"GET /script/header_footer.js?_=1486568008442"
"HTTP/1.1"
"200"
"2143"
"http://www.trendtron.com/popmenu/home"
"Mozilla/5.0"
"(Windows NT 6.1; rv:31.0)"
"Gecko/20100101"
"Firefox/31.0"\
"K-Meleon/75.1"

Answer 1

为什么不用空格分割最后一组？

import re
log = '10.243.166.74, 10.243.166.74 - - [08/Feb/2017:16:33:26 +0100] "GET /script/header_footer.js?_=1486568008442 HTTP/1.1" 200 2143 "http://www.trendtron.com/popmenu/home" "Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/31.0 K-Meleon/75.1"'

regex = re.compile('(.+?)\[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)"')
res = regex.match(log)
log_parts = list(res.groups())
devices_browsers_info_str = log_parts.pop(-1)
devices_browsers_info_parts = devices_browsers_info_str.split(' ')
log_parts.extend(devices_browsers_info_parts)

给我们

['10.243.166.74, 10.243.166.74 - - ', 
 '08/Feb/2017:16:33:26 +0100', 
 'GET /script/header_footer.js?_=1486568008442 HTTP/1.1', 
 '200', '2143', 'http://www.trendtron.com/popmenu/home',
 'Mozilla/5.0',
 '(Windows', 'NT', '6.1;', 'rv:31.0)',
 'Gecko/20100101', 
 'Firefox/31.0', 
 'K-Meleon/75.1']

Answer 2

(.+?)\- - \[(.+?)\:(.+?)\ (.+?)\] \"(.+?)\ (HTTP.+?)\" (.+?) (.+?) \"(.+?)\" \"(.+?) (.+?\)) (.+?)\ (.+?)\ (.+?)\"

或：http://regexr.com/3fndb

在python中解析日志文件

2 个答案: