如果另一列中的元素位于字典

时间:2017-04-18 06:30:25

标签: python file dictionary

我有两个问题我想解决,

  1. 我希望在存储后为每个ip地址逐个元素检查字典frequency4,如果该IP地址在文本文件中的数据行中的column[4]中它将继续在数据文件中添加该确切ip的字节数。

  2. 如果column[8]下的bytes包含“M”意味着百万,它会将该M转换为'* 1000000'等于33000000(请参阅下面的文本文件中的数据)请记住这是文本文件的示例,文本文件包含数千行数据。

  3. 我正在寻找的输出是:

    Total bytes for ip 172.217.9.133 is 33000000
    Total bytes for ip 205.251.24.253 is 9516
    Total bytes for ip 52.197.234.56 is 14546 
    

    CODE

    from collections import OrderedDict
    from collections import Counter
    
    frequency4 = Counter({})
    ttlbytes = 0
    
    
    with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile:    
        next(infile) 
        for line in infile:       
            if "Summary:" in line:
                break
            try:               
                srcip = line.split()[4].rsplit(':', 1)[0]
                frequency4[srcip] = frequency4.get(srcip,0) + 1 
                f4 = OrderedDict(frequency4.most_common())
                for srcip in f4:
                    ttlbytes += int(line.split()[8])
            except(ValueError):
                pass 
    print("\nTotal bytes for ip",srcip, "is:", ttlbytes)      
    for srcip, count in f4.items():    
        print("\nIP address from destination:", srcip, "was found:", count, "times.")
    

    数据文件

    Date first seen          Duration Proto      Src IP Addr:Port          Dst IP Addr:Port   Packets    Bytes Flows
    2017-04-11 07:23:17.880   929.748 UDP      172.217.9.133:443   ->  205.166.231.250:41138     3019    3.3 M     1
    2017-04-11 07:38:40.994     6.676 TCP     205.251.24.253:443   ->  205.166.231.250:24723       16     4758     1
    2017-04-11 07:38:40.994     6.676 TCP     205.251.24.253:443   ->  205.166.231.250:24723       16     4758     1
    2017-04-11 07:38:41.258     6.508 TCP      52.197.234.56:443   ->  205.166.231.250:13712       14     7273     1
    2017-04-11 07:38:41.258     6.508 TCP      52.197.234.56:443   ->  205.166.231.250:13712       14     7273     1
    Summary: total flows: 22709, total bytes: 300760728, total packets: 477467, avg bps: 1336661, avg pps: 265, avg bpp: 629
    Time window: 2017-04-11 07:13:47 - 2017-04-11 07:43:47
    Total flows processed: 22709, Blocks skipped: 0, Bytes read: 1544328
    Sys: 0.372s flows/second: 61045.7    Wall: 0.374s flows/second: 60574.9
    

2 个答案:

答案 0 :(得分:1)

我不知道你需要什么频率但是根据你的输入,如何获得所需的输出:

from collections import Counter

count = Counter()

with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile:   
    next(infile)
    for line in infile:      
        if "Summary:" in line:
            break

        parts = line.split()
        srcip = parts[4].rsplit(':', 1)[0]

        multiplier = 10**6 if parts[9] == 'M' else 1
        bytes = int(float(parts[8]) * multiplier)
        count[srcip] += bytes

for srcip, bytes in count.most_common():
    print('Total bytes for ip', srcip, 'is', bytes)

答案 1 :(得分:0)

好的我不确定你是否需要编辑同一个文件..如果你只是想处理数据并查看它,你可以使用pandas进行探索,因为它有许多可以加速数据处理的功能。

import pandas as pd
df = pd.read_csv(filepath_or_buffer = '/Users/rm/Desktop/nettestWsum.txt', index_col = False, header = None, skiprows = 1, sep = '\s\s+', skipfooter = 4)
df.drop(labels = 3, axis = 1, inplace = True)
# To drop the -> column
columnnames = 'Date first seen,Duration Proto,Src IP Addr:Port,Dst IP Addr:Port,Packets,Bytes,Flows'
columnnames = columnnames.split(',')
df.columns = columnnames

这会将数据加载到一个不错的数据帧(表)中。我建议你阅读pandas.read_csv方法here的文档。要处理数据,您可以尝试以下方法。

# converting data with 'M' to numeric data in millions
df['Bytes'] = df['Bytes'].apply(lambda x: float(x[:-2])*1000000 if x[-1] == 'M' else x)
df['Bytes'] = pd.to_numeric(df['Bytes'])
result = df.groupby(by = 'Dst IP Addr:Port').sum()

您的数据将出现在您可以使用的漂亮数据框(表格)中。它比循环更快我想,你可以单独进行测试。下面是加载后数据的样子。

DataFrame

以下是groupby的输出,您可以调整它。我正在使用Spyder IDE,而screengrab来自IDE中的变量资源管理器。您可以通过打印数据框或将其另存为另一个CSV来显示它。

enter image description here