如何从存储csv文件的字典中获取存储聚合值的字典

时间:2018-08-03 18:01:00

标签: python dictionary nested aggregate summary

我有一个包含以下内容的数据文件:

 Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

从这里开始,我希望有一个字典字典,每个字母都包含摘要数据,因此将是这样的:

dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
                       B:{10:6, 20:2, 30:3}}}, 
              {Part#2:{A:{10:5, 20:3, 30:3}, 
                       B:{10:7, 20:1, 30:3}}}, 
              {Part#3:{A:{10:6, 20:4, 30:1}, 
                       B:{10:4, 20:5, 30:2}}}} 

如果我想显示每个部分,它将为我提供如下输出:

dictionary[Part#1]

A
 10: 6
 20: 3
 30: 2

B
 10: 6
 20: 2
 30: 3

…等等,对于文件中的下几个分区。

目前,我已经能够将文件从txt解析为csv。并把它转换成字典,比方说外部字典。我已经测试了几种查看输出结果的方法,到目前为止,这段代码与我正在寻找的结构更接近(但不是全部),我已经在上面进行了描述。

partitions_dict = df_head(5).to_dict(orient='list')      

print(partitions_dict)

Output:

{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]

我用来解析文件的函数:

def fileFormatConverter(txt_file):
    """ Receives a generated text file  of partitions as a parameter
        and converts it into csv format.
        input: text file
        return: csv file """

    filename, ext = os.path.splitext(txt_file)
    csv_file = filename + ".csv"
    in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
    out_csv = csv.writer(open(csv_file,'w'))
    out_csv.writerows(in_txt)   
    return (csv_file)

# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None)   #, error_bad_lines=False)
df_traces.head()

输出:

    0   1   2   3   4   5   6   7   8   9   ...     15  16  17  18  19  20  21  22  23  24
0   A,  10,     20,     10,     10,     30,     10,     20,     10,     30,     ...     20,     10,     10,     30,     10,     30,     10,     20,     30.0    NaN
1   Part#2  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   A,  30,     30,     30,     10,     10,     20,     20,     20,     10,     ...     20,     10,     10,     30,     10,     30,     10,     30,     10.0    NaN
3   Part#3  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   A,  10,     20,     10,     30,     10,     20,     10,     20,     10,     ...     20,     20,     20,     30,     10,     10,     20,     20,     30.0    NaN

我使用了一个函数来更改标题,因此可以更轻松地操作每个分区内的字母:

def changeDFHeaders(df):

    df_transpose = df.T
    new_header = df_transpose.iloc[0]                       # stores the first row for the header
    df_transpose = df_transpose[1:]                         # take the data less the header row
    df_transpose.columns = new_header                       # set the header row as the df header
    return(df_transpose)


# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose))      # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()

输出:

    A,  Part#2  A,  Part#3  A,
1   10,     NaN     30,     NaN     10,
2   20,     NaN     30,     NaN     20,
3   10,     NaN     30,     NaN     10,
4   10,     NaN     10,     NaN     30,
5   30,     NaN     10,     NaN     10,
6   10,     NaN     20,     NaN     20,
7   20,     NaN     20,     NaN     10,
8   10,     NaN     20,     NaN     20,
9   30,     NaN     10,     NaN     10,
10  10,     NaN     10,     NaN     20,
11  20,     NaN     10,     NaN     10,
12  B,  NaN     B,  NaN     B,
13  10,     NaN     10,     NaN     10,
14  10,     NaN     10,     NaN     10,
15  20,     NaN     20,     NaN     20,
16  10,     NaN     10,     NaN     20,
17  10,     NaN     10,     NaN     20,
18  30,     NaN     30,     NaN     30,
19  10,     NaN     10,     NaN     10,
20  30,     NaN     30,     NaN     10,
21  10,     NaN     10,     NaN     20,
22  20,     NaN     30,     NaN     20,
23  30  NaN     10  NaN     30
24  NaN     NaN     NaN     NaN     NaN

-仍然不太正确...

,如果您检查以下语句:

df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')      

print(partitions_dict) 

输出:

{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}

2 个答案:

答案 0 :(得分:2)

我会避免熊猫,只是因为我不太了解它:

from collections import Counter

result = {}
part = ""
group = ""
for line in f:  # f being an open file
    sline = line.strip()
    if sline.startswith("Part"):
        part = sline
        result[part] = {}
        continue
    group = sline.split()[0]
    result[part][group] = Counter(sline.split()[1:])

结果采用以下形式:

{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})}, 
 'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})}, 
 'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}

如果直接从没有行分隔的文件开始,则可以使用“ Part”查找行,然后使用“ B”的索引来分隔两种数据类型:

result = {}
sf = f.split("Part")[1:]  # drop the empty first part
for line in sf:
    line = line.strip()  # remove trailing spaces
    sline = line.split()  # split on spaces
    result["Part%s" % sline[0]] = {}  # Use the index of B to split the value lists
    result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
    result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])

答案 1 :(得分:0)

输入文件为:

  Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

这应该有效

def parse_file(file_name):
    return_dict = dict()
    section = str()
    with open(file_name, "r") as source:
        for line in source.readlines():
            if "#" in line:
                section = line.strip()
                return_dict[section] = dict()
                continue
            tmp = line.strip().split()
            group = tmp.pop(0)
            return_dict[section][group] = dict()
            for item in tmp:
                if item in return_dict[section][group].keys():
                    return_dict[section][group][item] += 1
                else:
                    return_dict[section][group][item] = 1

    return return_dict

输出

{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
            'B': {'10': 6, '20': 2, '30': 3}},
 'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
            'B': {'10': 7, '20': 1, '30': 3}},
 'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
            'B': {'10': 4, '20': 5, '30': 2}}}

老实说,我不明白为什么需要中间阶段,好像您必须解析一次文件以创建CSV一样,只需在其中创建dict()的逻辑即可。因此,如果我错过了这个问题的微妙之处,我表示歉意。

编辑:根据评论重新制定答案,即输入文件实际上是一行

因此输入文件为

Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30

以下修改后的代码将起作用

import string
from pprint import pprint

def parse_file2(file_name):
    return_dict = dict()
    section = None
    group = None
    with open(file_name, "r") as source:
        for line in source.readlines():
            tmp_line = line.strip().split()
            for token in tmp_line:
                if "#" in token:
                    section = token
                    return_dict[section] = dict()
                    continue
                elif token in string.ascii_uppercase:
                    group = token
                    return_dict[section][group] = dict()
                    continue
                if section and group:
                    if token in return_dict[section][group].keys():
                        return_dict[section][group][token] += 1
                    else:
                        return_dict[section][group][token] = 1

    return return_dict

if __name__ == "__main__":
    pprint(parse_file(file_name))
    pprint(parse_file2(file_name2))

请注意,此功能专门用于注释中提到的文件格式。如果文件格式不符合您的要求,则可能会爆炸。

基于该问题,尽管这应该可行。

此外,如果您可以简化上面的问题帖以说明实际的文件内容和所需的结果,或者仅放入我具有结构A并希望将其转换为结构B的内容,则我将清理所有历史记录在这篇文章中,还有一个更简单的答案。

希望这会有所帮助! :)