根据多个因素对多个列表进行分组

时间:2016-12-07 20:17:56

标签: python list

Python新手......

如果我有一个输出文件(防火墙日志文件),如下所示:

(source)    (dest)    (proto) (service)

10.10.10.1    20.20.20.1    TCP 80
10.10.10.1    30.30.30.1    TCP 80
10.10.10.1    40.40.40.1    TCP 514
10.10.10.1    40.40.40.1    TCP 443

我需要根据匹配的4个中的3个对这些数据进行分组。所以基于上面的输出,我需要将其写入一个看起来像

的新文件
10.10.10.1    20.20.20.1;30.30.30.1    TCP 80

                 OR
10.10.10.1    40.40.40.1    TCP 514, 443

(请注意使用分号分隔IP地址,在第二行使用逗号分隔服务对象)

我已经查看了python groupby方法,但我无法正确理解

所以用英语(在我脑海里):

for every line in the file,
    if source and/or dest and/or proto, and/or service match any other line in  
     line in the file
        combine on one line and write to file (with semicolon if separting IP
        addresses and a comma if separating service objects)

import re
from itertools import groupby
from sys import argv
#Written by Clyde Colbert - f7cmb14

script, filename = argv

data = []

def connection_list(filename):
    try:
        with open(filename, "r") as file:
            text = file.read()
    except IOError:
        print(filename, "Does not exist in the current directory. Are you in the correct directory???")

    sources = re.findall(r'src=(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', text)
    dest = re.findall(r'dst=(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', text)
    service = re.findall(r'service=(\d+)', text)
    proto = re.findall(r'proto=(\w+)', text)

    proto = [item.upper() for item in proto]
    sources = [item.split('=')[1] for item in sources]
    dest = [item.split('=')[1] for item in dest]

    with open(filename + "OUTPUT.txt", "w") as TufinReq:
        for item in zip(sources, dest, proto, service):
            TufinReq.write('{}\t{}\t{} {}\n'.format(*item))

    f=open(filename + "OUTPUT.txt", "r")
    list = []
    for line in f:
        if line not in list:
            list.append(line)
    f.close()
    f=open(filename + "OUTPUT.txt", "w+")
    for line in list:
        f.write(line)
    f.close()

    f=open(filename + "Output.txt", "r")
    for line in f:
        data.append(line)

cols = (0,2,3)
def getcolumns(cols):
    cols = (0,2,3)
    def f(row):
        return tuple(row[i] for i in cols)
    return f

for k, v in groupby(data, getcolumns(cols)):
    print(k, list(v))

connection_list(filename)

1 个答案:

答案 0 :(得分:0)

groupby(iterable, keyfunc)的工作原理是对具有相同key的项目进行分组(keyfunc返回的值。

要完成任务,您可以让keyfunc返回一行中的多个项目。

为简单起见,我们假设您已经拥有以下格式的数据:

data=[
('10.10.10.1', '20.20.20.1', 'TCP', '80'),
('10.10.10.1', '30.30.30.1', 'TCP', '80'),
('10.10.10.1', '40.40.40.1', 'TCP', '514'),
('10.10.10.1', '40.40.40.1', 'TCP', '443')
]

因此,如果您想查看源,原型和服务匹配的行(列索引0,2和3),您可以创建这些列的键。

让我们写一个小的闭包来提取那些列(它将是你的keyfunc):

def getcolumns(cols):
    def f(row):
        return tuple(row[i] for i in cols)
    return f

让我们看看你得到的结果:

>>> cols = (0,2,3)
>>> data.sort(key=getcolumns(cols))
>>> for k, v in groupby(data, getcolumns(cols)):
...     print(k, list(v))
...
('10.10.10.1', 'TCP', '80') [('10.10.10.1', '20.20.20.1', 'TCP', '80'), ('10.10.10.1', '30.30.30.1', 'TCP', '80')]
('10.10.10.1', 'TCP', '514') [('10.10.10.1', '40.40.40.1', 'TCP', '514')]
('10.10.10.1', 'TCP', '443') [('10.10.10.1', '40.40.40.1', 'TCP', '443')]

您可能希望排除石斑鱼长度为1(无匹配)的结果:

>>> cols = (0,2,3)
>>> data.sort(key=getcolumns(cols))
>>> for k, v in groupby(data, getcolumns(cols)):
...     v = list(v)
...     if len(v) == 1: continue
...     print(k, v)
...
('10.10.10.1', 'TCP', '80') [('10.10.10.1', '20.20.20.1', 'TCP', '80'), ('10.10.10.1', '30.30.30.1', 'TCP', '80')]

现在只需要一点处理就可以将其转换为您正在寻找的输出格式:

>>> cols = (0,2,3)
>>> data.sort(key=getcolumns(cols))
>>> for k, v in groupby(data, getcolumns(cols)):
...     v = list(v)
...     if len(v) == 1: continue
...     print(*(';'.join(set(r[i] for r in v)) for i in range(len(v[0]))))
...
10.10.10.1 20.20.20.1;30.30.30.1 TCP 80

(如果你想使用这个实现但是你想保留行的顺序,请使用OrderedSet