Question

我正在解析从内容交付网络获取的日志文件。我已经到了能够隔离日志文件的一部分的地步，这是访问我们网站的IP地址。我想在这里实现的是从每个IP地址的大型列表中排名前10位的IP地址。我打印列表时得到的一些示例数据如下所示：

192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.2
192.168.1.2
192.168.1.2
192.168.1.2
192.168.1.1
192.168.1.1
192.168.1.1

这些不是我从输出中得到的真正的IP，还有更多。正如您所看到的，它们并未组合在一起。我该怎么办呢？

编辑：这是我的代码

import gzip
from collections import Counter
logFileName = open('C:\\Users\\Pawlaczykm\\Desktop\\fileNames.txt', 'r')
for line in logFileName.readlines():
    print 'Summary of: ' + line
    # use gzip to decompress the file
    with gzip.open('C:\\Users\\Pawlaczykm\\Desktop\\' + line.rstrip() + '.gz', 'rb') as f:
    for eachLine in f:
        parts = eachLine.split('\t')
        if len(parts) > 1:
            ipAdd = parts[2]
            c = Counter(ipAdd.splitlines())
            print(c.most_common(10))

Answer 1

您可以使用collections.Counter：

s = """192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.1
192.168.1.2
192.168.1.2
192.168.1.2
192.168.1.2
192.168.1.1
192.168.1.1
192.168.1.1"""

from collections import Counter
c = Counter(s.splitlines())

现在您可以获得10个最常见的地址，即前10名列表：

print(c.most_common(10))

输出：

[('192.168.1.1', 8), ('192.168.1.2', 4)]

这是一个包含地址的列表。

在您的情况下，您需要向计数器提供所有地址：

addresses = []
for eachLine in f:
    parts = eachLine.split('\t')
    if len(parts) > 1:
        ipAdd = parts[2]
        addresses.append(ipAdd.strip())
c = Counter(addresses)
print(c.most_common(10))

将列表中的“相似值”组合在一起

1 个答案: