
时间:2015-06-03 18:23:04

标签: python parsing logging

我正在使用Python日志记录在处理时生成日志文件,我正在尝试将这些日志文件读入list / dict,然后将其转换为JSON并加载到nosql数据库进行处理。


2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

注意:在您看到的每个新日期之前实际上都有\ n中断,但似乎无法在此处表示。


    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'

    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '





with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:


logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)




6 个答案:

答案 0 :(得分:7)

使用@Joran Beasley的回答我提出了以下解决方案,它似乎有效:


  • 我的日志文件始终遵循相同的结构:{Date} - {Type} - {Message}所以我使用字符串切片和拆分来解决我的问题 需要它们。示例{Date}始终为23个字符,仅限I 想要前19个字符。
  • 使用line.startswith(&#34; 2015&#34;)是疯狂的,因为日期最终会改变,因此创建了一个新函数,它使用一些正则表达式来匹配我期望的日期格式。我的日志日期再次遵循特定模式,因此我可以获得具体的信息。
  • 将文件读入第一个函数&#34; generateDicts()&#34;然后调用&#34; matchDate()&#34;函数看IF正在处理的行是否与我正在寻找的{Date}格式匹配。
  • 每次找到有效的{Date}格式时都会创建一个新的dict,所有内容都会被处理,直到遇到NEXT有效{Date}为止。


def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))


    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
            matchThis = "NONE"
        return matchThis

答案 1 :(得分:3)


def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))


答案 2 :(得分:2)


>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']



答案 3 :(得分:0)

@ steven.levey提供的解决方案非常完美。我想做的一个补充是使用这个正则表达式模式来确定该行是否正确并提取所需的值。因此,在使用正则表达式确定格式后,我们不必再次分割行。

pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'

答案 4 :(得分:0)

list = []
with open('bla.txt', 'rb') as file:
  for line in file.readlines():
    d = dict()
    if len(line.split(' - ')) >= 4:
      d['Date'] = line.split(' - ')[0]
      d['Type'] = line.split(' - ')[2]
      d['Message'] = line.split(' - ')[3]


    'Date': '2015-05-22 16:46:46,985',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:46:56,645',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:47:46,488',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:48:48,180',
    'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
    'Type': 'ERROR'
}, {
    'Date': '2015-05-22 16:49:17,918',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:32,160',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:39,329',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:53:30,706',
    'Message': 'Starting to Wait for Files',
    'Type': 'INFO'

答案 5 :(得分:0)

我最近有一个类似的解析日志记录的任务,但还有用于进一步分析的异常回溯。我没有用自制的正则表达式来对付我,而是使用了两个很棒的库:parse 用于解析记录(这实际上是一个非常酷的库,实际上是 stdlib 的 string.format 的反函数)和 {{3 }} 用于解析回溯。这是我从我的 impl 中提取的示例代码,适用于有问题的日志:

import datetime
import logging
import os
from pathlib import Path
from boltons.tbutils import ParsedException
from parse import parse, with_pattern

LOGGING_DEFAULT_DATEFMT = f"{logging.Formatter.default_time_format},%f"

# TODO better pattern
@with_pattern(r"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d")
def parse_logging_time(raw):
    return datetime.datetime.strptime(raw, LOGGING_DEFAULT_DATEFMT)

def from_log(file: os.PathLike, fmt: str):
    chunk = ""
    custom_parsers = {"asctime": parse_logging_time}

    with Path(file).open() as fp:
        for line in fp:
            parsed = parse(fmt, line, custom_parsers)
            if parsed is not None:
                yield parsed
            else:  # try parsing the stacktrace
                chunk += line
                    yield ParsedException.from_string(chunk)
                    chunk = ""
                except (IndexError, ValueError):

if __name__ == "__main__":
    for parsed_record in from_log(
        fmt="{asctime:asctime} - {module} - {levelname} - {message}"


<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 46, 985000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 56, 645000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 47, 46, 488000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 48, 48, 180000), 'module': '__main__', 'levelname': 'ERROR', 'message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n'}>
ParsedException('NameError', "name 'numFilesDownloaded' is not defined", frames=[{'filepath': '<ipython-input-16-132cda1c011d>', 'lineno': '10', 'funcname': '<module>', 'source_line': 'if numFilesDownloaded == 0:'}])
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 17, 918000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 32, 160000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 39, 329000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 53, 30, 706000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>


如果您使用 { 样式指定日志格式,则很有可能您只需将日志格式字符串传递给 parse,它就会正常工作。在这个例子中,我不得不即兴发挥并使用自定义的时间戳解析器来匹配问题的要求;如果时间戳是通用格式,例如ISO 8601,可以只使用 fmt="{asctime:ti} - {module} - {levelname} - {message}" 并从示例代码中丢弃 parse_logging_timecustom_parsersparse 支持多种开箱即用的常见时间戳格式;查看boltons

parse.Result 是类似 dict 的对象,因此 parsed_record["message"] 返回解析后的消息等。

注意打印的 ParsedException 对象 - 这是从回溯中解析出的异常。
