Question

我正在寻找比较列表中包含源IP，目标IP，数据包时间和大小的多行。我想在具有相同源IP和目标IP的所有行之间合并数据。例如，如果有2条或更多行具有相同的源IP和目标IP，我该如何合并所有数据。我不想只比较第一行和第二行，我想匹配列表中具有相同的172.217.2.161（源）和10.247.15.39（目标）的所有行，然后提取第一个时间戳和最后一个时间戳记到新列表中。

def combine_data(source, dest, time, length):
    CombinePacket = [(source[i], dest[i], time[i], length[i]) for i in range(len(source))]
    NewData = []
    TotalSize = 0

    for i, j in zip(CombinePacket, CombinePacket[1:]):
        if(i[0:2] == j[0:2]):
            TotalSize = TotalSize + int(i[3])+int(j[3])
            data = i[0], i[1], i[2], j[2], TotalSize
            NewData.append(data)

列表包含

[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46)]
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29)]
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200' 50)]

输出应为

[['172.217.2.161'], ['10.247.15.39'],'13:25:31.044180', '13:25:31.044200', 125]

Answer 1

您可以使用itertools.groupby

进行此类任务

from __future__ import print_function

import itertools


def key(packet):
    return packet[0], packet[1]  # source and destination


def do_combine_data(sources, destinations, times, lengths):
    packets = zip(sources, destinations, times, lengths)

    for (packet_source, packet_dest), group in itertools.groupby(
            sorted(packets, key=key), key=key):
        group = list(group)
        packet_sizes = [packet_size for (_, _, _, packet_size) in group]
        packet_times = [at for (_, _, at, _) in group]

        start_time, end_time = [func(packet_times) for func in (min, max)]
        total_size = sum(packet_sizes)

        yield packet_source, packet_dest, start_time, end_time, total_size

之后，您可以根据需要使用它（甚至将source和destination包装在自己的列表中）

def combine_data(source, dest, time, length):
    return [
        ([[s], [d], b, e, t])
        for s, d, b, e, t in do_combine_data(source, dest, time, length)]


def main():
    sources = ["a", "a", "a", "a", "a"]
    destinations = ["b", "b", "b", "c", "c"]
    times = ["1", "2", "5", "3", "4"]
    lengths = [12, 11, 51, 89, 17]
    print(combine_data(sources, destinations, times, lengths))


if __name__ == '__main__':
    main()

输出将为

[[['a'], ['b'], '1', '5', 74], [['a'], ['c'], '3', '4', 106]]

Answer 2

保留字典并随时更新值，然后将其转换为列表。假设您有一个像这样的列表：

data = [[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46)],
 [(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29)],
 [(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200' 50)]]

然后：

d = dict()
for dat in data:
    sourceIp = dat[0][0][0]
    destIp = dat[0][1][0]
    minTs = dat[0][2]
    maxTs = dat[0][3]
    count = dat[0][4]
    k = (sourceIp, destIp)
    if (k not in d):
        d[k] = (minTs, maxTs, count)
    else:
        val = d[k]
        d[k] = (min(minTs, val[0]), max(maxTs, val[1]), count + val[2])


output = [ [[k[0]], [k[1]], v[0], v[1], v[2]] for (k,v) in d.items() ]

当然，您可以构建此词典而不是首先构建列表，以避免中介列表。另外，如果您不需要IP，我建议不要使用IP的单例列表，因为它只会导致索引混乱。

Answer 3

这是我的主意：

data = [
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46),
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29),
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200', 50)
]
source = [d[0] for d in data]
dest = [d[1] for d in data]
time = [d[2] for d in data]
length = [d[3] for d in data]

from collections import defaultdict
import datetime
def combine_data(source, dest, time, length):
    CombinePacket = [(source[i], dest[i], time[i], length[i]) for i in range(len(source))]
    NewData = []
    TotalSize = 0

    data = defaultdict(list)
    for package in CombinePacket:
        data[(package[0][0],package[1][0])].append((package[2],package[3]))

    result = []
    for key,value in data.items():
        value = sorted(value,key = lambda x : x[0])
        first_time = value[0][0]
        last_time = value[-1][0]
        sum_length = sum(v[1] for v in value)
        result.append([key[0],key[1],first_time,last_time,sum_length])

    return result

将数据保存到键为(source,dest)的字典中，然后对时间进行排序以获得第一个和最后一个时间戳，并且totalsize是该值内所有大小的总和。

比较并合并列表中的数据

3 个答案: