从日志文件中提取唯一访问者列表

时间:2014-01-07 16:39:59

标签: python regex parsing search logging

我想从一个看起来像这样的日志文件列表(名为access.log.*)中提取

95.11.113.x - [15/Nov/2013:18:25:17 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:19 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:21 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
125.111.9.x - [15/Nov/2013:20:00:00 +0100] "GET /files/azeazzae.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

访问/files/myfile.rar唯一访问者列表(每天只有一次重复),即:

95.11.113.x - [15/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

我尝试打开文件并查找所需的字符串/files/myfile.rar,如下所示:Search for string in txt file Python,但我无法测试“相同的IP地址”和重复。

我应该怎么做才能做到这一点?标准字符串搜索,一行接一行(Search for string in txt file Python)?正则表达式吗

PS:以后使用更好(按日期排序等)

2013-11-15 - 95.11.113.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-16 - 132.41.100.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-17 ....

3 个答案:

答案 0 :(得分:1)

这应该是你的python代码的算法:

1)从文件中读取每一行 2)如果该行包含文本/files/myfile.rar,则为   3)从线路解析IP地址。您可以使用正则表达式,也可以在空格之前使用拆分。
  4)以这种方式将行保存到python中的dict()变量visitors[ip] = line

完成后,打印visitors输出。

以下是3)和4)的示例代码。

visitors = dict()
# this should be same for each line
line = '95.11.113.x - [15/Nov/2013]'
ip = line.split(" - ")[0]  # assuming it must have " - " in line
visitors[ip] = line

# finally when you are done with above things
for visitor in visitors:
    print visitors[visitor]

答案 1 :(得分:1)

以下是按日期排序答案的方法,即每天请求myfile.rar的唯一身份访问者对所有名为access.log.*的文件进行排序:

import glob

from collections import defaultdict

d = defaultdict(set)

for file in glob.glob('access.log.*'):
   with open(file) as log:
      for line in log:
          if len(line.strip()): # skips empty lines
              bits = line.split('-')
              ip = bits[0].strip()
              date = bits[1].split()[0][1:][:-9]
              url = bits[1].split()[3]
              if url == '/files/myfile.rar':
                  d[date].add(ip)

for date,values in d.iteritems():
  print('Total unique visits for {}: {}'.format(date, len(values))
  for ip in values:
     print(ip)

答案 2 :(得分:0)

以下答案是SabujHassan的回答方法的结果。我只发布它以备将来使用。

visitors = dict()

with open('access.log.52') as fp:
    for line in fp:
        if '/files/myfile.rar' in line:
            ip = line.split(" - ")[0]  # assuming it must have " - " in line
            visitors[ip] = line

for ip in visitors:
    print visitors[ip]