请求日志解析器 - 文本解析

时间:2017-12-12 10:02:25

标签: java awk groovy text-parsing processing-efficiency

我必须解析具有以下结构的请求日志

07/Dec/2017:18:15:58 +0100 [293920] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293920] <- 200 text/html 5ms
07/Dec/2017:18:15:58 +0100 [293921] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293921] <- 200 image/png 39ms
07/Dec/2017:18:15:59 +0100 [293922] -> HEAD URL HTTP/1.0
07/Dec/2017:18:15:59 +0100 [293922] <- 401 - 1ms
07/Dec/2017:18:15:59 +0100 [293923] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293923] <- 200 text/html 178ms
07/Dec/2017:18:15:59 +0100 [293924] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293924] <- 200 text/html 11ms
07/Dec/2017:18:15:59 +0100 [293925] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293925] <- 200 text/html 7ms
07/Dec/2017:18:15:59 +0100 [293926] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293926] <- 200 text/html 16ms
07/Dec/2017:18:15:59 +0100 [293927] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293927] <- 200 text/html 8ms

输出应根据方括号之间的数字链接此日志中的两行。 目标是使用其他数据处理软件包从此日志文件中提取信息。 我想使用csv文件提取有用的信息。 csv文件的结构应如下所示。

startTimestamp,endTimestamp,requestType/responseCode,URL/typ,responsetime

07/Dec/2017:18:15:58,07/Dec/2017:18:15:58,GET,200,URL,text/html,5ms

我制作了一个groovyScript来完成这个技巧,但它非常慢。

我知道我可以做一些改进,但想要你的想法。你们当中有些人过去可能已经解决了这个问题。

响应并不总是遵循请求。 并非每个请求都会收到响应(或者由于服务器重新启动而未记录)

日志文件可以从70mb到300mb。我的groovyScript花了很长时间。

我知道unix终端中有很好的快速解决方案,有awk和sort。但没有这方面的经验。

提前感谢您的帮助

这是我已有的代码 可能的改进

1)使用地图,其中键是数字,以便更快地搜索和减少解析

2)不要查看每行的积压列表

def logFile = new File("../request.log")
def outputfile = new File(logFile.parent, logFile.name + ".csv")
def backlog = new ArrayList<String>()
StringBuilder output = new StringBuilder()


outputfile.withPrintWriter { writer ->
    logFile.withReader { Reader reader ->
        reader.eachLine { String line ->
            Iterator<String> it = backlog.iterator()
            while (it.hasNext()) {
                String bLine = it.next()
                String[] lineSplit = line.split(" ")
                if (bLine.contains(lineSplit[2])) {
                    String[] bLineSplit = bLine.split(" ")
                    output.append(bLineSplit[0] + "," + lineSplit[0] + "," + bLineSplit[4] + "," + lineSplit[4] + "," + bLineSplit[5] + "," + lineSplit[5] + "," + lineSplit[6] + "\r\n")
                    //writer.println(outputline)
                    it.remove()
                }
            }
            backlog.add(line)
        }
    }
    writer.println(output)
    if (!backlog.isEmpty()) {
    }
    backlog.each { String line ->
        writer.println(line)
    }
}

1 个答案:

答案 0 :(得分:0)

作为单行:

sort -k 3,3 request.log | awk 'BEGIN { print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"; split("", request); split("", response) } $4 == "->" { printLine(); split($0, request); split("", response) } $4 == "<-" { split($0, response) } END { printLine() } function printLine() { if (length(request)) { print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7] } }'

作为多班轮:

sort -k 3,3 request.log | awk '
    BEGIN {
        print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"
        split("", request)
    }
    $4 == "->" {
        printLine()
        split($0, request)
        split("", response)
    }
    $4 == "<-" {
        split($0, response)
    }
    END {
        printLine()
    }
    function printLine() {
        if (length(request)) {
            print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7]
        }
    }'