通过将组与正则表达式匹配,将多个列添加到data.table

时间:2015-10-01 03:40:57

标签: r data.table

R新手在这里,这可能是显而易见的,但我只是没有正确地对待我的搜索。

我正在将Web服务器日志解析为data.table,我想通过从请求字符串中提取部分来创建一堆列。我的源数据如下所示:

2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930]  "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1" 200 26294 "https://bela.com/home/amazeballs" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 2.031 2.031 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930]  "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1" 200 4485 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.173 0.173 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930]  "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1" 200 4851 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.168 0.168 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930]  "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1" 200 7499 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.290 0.290 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:15 +0930]  "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1" 200 132880 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.366 0.366 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930]  "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 1386 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.233 0.233 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930]  "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 2121 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.108 0.108 .
2015-09-01T07:18:17+09:30 bozobox nginx_access: 10.0.0.1 - - [01/Sep/2015:07:18:16 +0930]  "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1" 200 3230 "https://bela.com/silly/sales/1234567890?amazeballsTask=Y" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)" "-" 0.174 0.174 .

所以我敲了下面的代码:

alog <- fread('cat sample.log | grep -v "GET /junk" | cut -f 4,6- -d " " ')
setnames(alog, c("ip","remote_user","datetime","timezone","request","status","bytes","referer","user_agent","http_x_forwarded_for","request_time","upstream_response_time","pipe"))

request_parts <- function(x) {
  m <- regexec("^([A-Z]+) /([^/]+)/([^\\?]+)(\\?[^ ]+)? HTTP/(.*)", x)
  parts <- do.call(rbind, lapply(regmatches(x, m), `[`, c(2, 3, 4, 5, 6)))
  colnames(parts) <- c("method","webapp","page","query_string", "http_version")
  parts
}

parts <- request_parts(alog$request)

它似乎达到了一定的目的:

> alog$request
[1] "GET /silly/sales/1234567890?amazeballsTask=Y HTTP/1.1"                                     "GET /silly/jawr/css/gzip_N676825985/bundles/app.css HTTP/1.1"                             
[3] "GET /silly/jawr/css/gzip_2073017426/bundles/lib.css HTTP/1.1"                              "GET /silly/jawr/js/gzip_1764696599/bundles/app.js HTTP/1.1"                               
[5] "GET /silly/jawr/js/gzip_N1319387470/bundles/lib.js HTTP/1.1"                               "GET /silly/js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1"
[7] "GET /silly/styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1"           "GET /silly/js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558 HTTP/1.1"

> parts
     method webapp  page                                                                    query_string        http_version
[1,] "GET"  "silly" "sales/1234567890"                                                      "?amazeballsTask=Y" "1.1"       
[2,] "GET"  "silly" "jawr/css/gzip_N676825985/bundles/app.css"                              ""                  "1.1"       
[3,] "GET"  "silly" "jawr/css/gzip_2073017426/bundles/lib.css"                              ""                  "1.1"       
[4,] "GET"  "silly" "jawr/js/gzip_1764696599/bundles/app.js"                                ""                  "1.1"       
[5,] "GET"  "silly" "jawr/js/gzip_N1319387470/bundles/lib.js"                               ""                  "1.1"       
[6,] "GET"  "silly" "js/ajaxResponseHandler.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558" ""                  "1.1"       
[7,] "GET"  "silly" "styles/tabs.css;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558"           ""                  "1.1"       
[8,] "GET"  "silly" "js/tabs.js;jsessionid=4EFF0C6ECC2565927321AE8ED72E8558"                ""                  "1.1"        

但这不符合我的要求(将所有部分的列添加到alog上):

> alog$method
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # yay!
> alog$webapp
[1] "GET" "GET" "GET" "GET" "GET" "GET" "GET" "GET"
> # dismay :(

我做错了什么?有很多警告如下,但我并没有真正得到他们想告诉我的东西。

1: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) :
  5 column matrix RHS of := will be treated as one vector
2: In `[.data.table`(alog, , `:=`(colnames(parts), parts)) :
  Supplied 40 items to be assigned to 8 items of column 'method' (32 unused)

1 个答案:

答案 0 :(得分:4)

parts是一个矩阵;你必须转换为data.table才能工作。这是一个例子:

m <- matrix(1:25, nc=5)
colnames(m) <- LETTERS[1:5]
library(data.table)
dt <- data.table(x=1:5)

dt[,colnames(m):=m]    
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(colnames(m), m)) :
#   5 column matrix RHS of := will be treated as one vector
# ...
dt          # not what you want...
#    x A B C D E
# 1: 1 1 1 1 1 1
# 2: 2 2 2 2 2 2
# 3: 3 3 3 3 3 3
# 4: 4 4 4 4 4 4
# 5: 5 5 5 5 5 5

dt[,colnames(m):=as.data.table(m)]
dt          # better
#    x A  B  C  D  E
# 1: 1 1  6 11 16 21
# 2: 2 2  7 12 17 22
# 3: 3 3  8 13 18 23
# 4: 4 4  9 14 19 24
# 5: 5 5 10 15 20 25