对某些列进行分组,然后汇总CSV

时间:2019-01-29 13:19:47

标签: python python-3.x csv

我的csv中有数据需要解析。看起来像:

Date,Tag,Amount
13/06/2018,ABC,6750000
13/06/2018,ABC,159800
24/05/2018,ABC,-1848920
16/05/2018,AB,-1829700
16/05/2018,AB,3600000
28/06/2018,A,15938000
16/05/2018,AB,3748998
28/06/2018,A,1035000
28/06/2018,A,1035000
14/06/2018,ABC,2122717

您可以看到每个日期旁边都有一个标签和数字。 我要达到的目的是确定日期并标记关键字,并按日期和标记进行分组并总结金额。

预期结果

Date,Tag,Amount
13/06/2018,ABC,5220680
16/05/2018,AB,5519298
28/06/2018,A,18008000
14/06/2018,ABC,2122717

我现在使用的代码在下面,无法正常工作。

from collections import defaultdict
import csv

d = defaultdict(int)

with open("file.csv") as f:
    for line in f:
        tokens = [t.strip() for t in line.split(",")]
        try:
            date = int(tokens[0])
            tag = int(tokens[1])
            amount = int(tokens[2])
        except ValueError:
            continue
        d[date] += amount

print d

有人可以告诉我如何不用熊猫来避免这种情况吗

3 个答案:

答案 0 :(得分:1)

您绝对应该使用pandas。除了您必须自己编写代码之外,您只需安装pandas模块,然后导入它(import pandas as pd),即可使用2条简单直观的代码行来解决此问题

>>> df = pd.read_csv('file.csv')
>>> df.groupby(['Date', 'Tag']).Amount.sum()

Date        Tag
13/06/2018  ABC     6909800
14/06/2018  ABC     2122717
16/05/2018  AB      5519298
24/05/2018  ABC    -1848920
28/06/2018  A      18008000

如果您真的需要自己编写代码,则可以使用嵌套的defaultdict,这样就可以具有两层groupby。另外,为什么还要尝试将intdate转换为tag?毫无意义。只需将其删除。

d = defaultdict(lambda: defaultdict(int))

for line in z:
    tokens = [t.strip() for t in line.split(",")]
    try:
        date = tokens[0]
        tag = tokens[1]
        amount = int(tokens[2])
    except ValueError as e:
        continue
    d[date][tag] += amount

输出为:

13/06/2018 ABC 6909800
24/05/2018 ABC -1848920
16/05/2018 AB 5519298
28/06/2018 A 18008000
14/06/2018 ABC 2122717

要输出上面的结果,只需遍历以下各项:

for k,v in d.items():
    for k2, v2 in v.items():
        print(k,k2,v2)

要使您的代码更好,请仅阅读第一行,然后从第二行进行迭代直到最后。这样,您的try / except可以删除,您将获得更简单,更简洁的代码。但是你可以从这里接机,对吗? ;)

要写入csv,只需

s = '\n'.join(['{0} {1} {2}'.format(k, k2, v2) for k,v in d.items() for k2,v2 in v.items()])
with open('output.txt', 'w') as f:
    f.write(s)

答案 1 :(得分:0)

这是使用简单迭代的一种方法。

例如:

from collections import defaultdict
import csv

result = defaultdict(int)
with open(filename) as infile:
    reader = csv.reader(infile)
    header = next(reader)
    for line in reader:
        result[tuple(line[:2])] += int(line[2])

print(header)
for k, v in result.items():
    print(k[0], k[1], v)

输出:

14/06/2018 ABC 2122717
13/06/2018 ABC 6909800
28/06/2018 A 18008000
16/05/2018 AB 5519298
24/05/2018 ABC -1848920

至CSV

with open(filename, "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(header)
    for k, v in result.items():
        writer.writerow([k[0], k[1], v])

答案 2 :(得分:0)

您可以使用itertools.groupby

from itertools import groupby 
import csv
header, *data = csv.reader(open('filename.csv'))
new_data = [[a, list(b)] for a, b in groupby(sorted(data, key=lambda x:x[:2]), key=lambda x:x[:2])]
results = [[*a, sum(int(c) for *_, c in b)] for a, b in new_data]
with open('calc_results.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([header, *results])

输出:

Date,Tag,Amount
13/06/2018,ABC,6909800
14/06/2018,ABC,2122717
16/05/2018,AB,5519298
24/05/2018,ABC,-1848920
28/06/2018,A,18008000