Question

我想从一个看起来像这样的日志文件列表（名为access.log.*）中提取

95.11.113.x - [15/Nov/2013:18:25:17 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:19 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:21 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
125.111.9.x - [15/Nov/2013:20:00:00 +0100] "GET /files/azeazzae.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

访问/files/myfile.rar的唯一访问者列表（每天只有一次重复），即：

95.11.113.x - [15/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

我尝试打开文件并查找所需的字符串/files/myfile.rar，如下所示：Search for string in txt file Python，但我无法测试“相同的IP地址”和重复。

我应该怎么做才能做到这一点？标准字符串搜索，一行接一行（Search for string in txt file Python）？正则表达式吗

PS：以后使用更好（按日期排序等）：

2013-11-15 - 95.11.113.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-16 - 132.41.100.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-17 ....

Answer 1

这应该是你的python代码的算法：

1）从文件中读取每一行 2）如果该行包含文本/files/myfile.rar，则为 3）从线路解析IP地址。您可以使用正则表达式，也可以在空格之前使用拆分。
4）以这种方式将行保存到python中的dict()变量visitors[ip] = line

完成后，打印visitors输出。

以下是3）和4）的示例代码。

visitors = dict()
# this should be same for each line
line = '95.11.113.x - [15/Nov/2013]'
ip = line.split(" - ")[0]  # assuming it must have " - " in line
visitors[ip] = line

# finally when you are done with above things
for visitor in visitors:
    print visitors[visitor]

Answer 2

以下是按日期排序答案的方法，即每天请求myfile.rar的唯一身份访问者对所有名为access.log.*的文件进行排序：

import glob

from collections import defaultdict

d = defaultdict(set)

for file in glob.glob('access.log.*'):
   with open(file) as log:
      for line in log:
          if len(line.strip()): # skips empty lines
              bits = line.split('-')
              ip = bits[0].strip()
              date = bits[1].split()[0][1:][:-9]
              url = bits[1].split()[3]
              if url == '/files/myfile.rar':
                  d[date].add(ip)

for date,values in d.iteritems():
  print('Total unique visits for {}: {}'.format(date, len(values))
  for ip in values:
     print(ip)

Answer 3

以下答案是SabujHassan的回答方法的结果。我只发布它以备将来使用。

visitors = dict()

with open('access.log.52') as fp:
    for line in fp:
        if '/files/myfile.rar' in line:
            ip = line.split(" - ")[0]  # assuming it must have " - " in line
            visitors[ip] = line

for ip in visitors:
    print visitors[ip]

从日志文件中提取唯一访问者列表

3 个答案: