在python中有效地组合类似的CSV行

时间:2017-08-25 16:50:52

标签: python pandas csv dataframe

我想将非常大的csv文件中的相似行(每个近1GB!)组合成一个。我有兴趣做这样的事情:

之前

First Name | Last Name | Phone Number | Email

John       | Doe       | 1234         | john@doe.com
Jane       | Doe       | 4321         | jane@doe.com
John       | Doe       | 6789         | john@gmail.com
Jane       | Doe       | 9876         | jane@gmail.com

First Name | Last Name | Phone Number | Email

John       | Doe       | 1234, 6789   | john@doe.com, john@gmail.com
Jane       | Doe       | 4321, 9876   | jane@doe.com, jane@gmail.com

也就是说,要使用名字和姓氏以及电话和电子邮件组合行,将它们添加到“列表”中。

由于

2 个答案:

答案 0 :(得分:1)

要阅读CSV文件,您需要pd.read_csv

 df = pd.read_csv('file.csv', delimiter='|', sep='\s+')

您将在df.groupbyFirst Name以及Last Name上致电dfGroupBy.agg加入:

print(df)

    First Name    Last Name  Phone Number            Email
0  John          Doe                 1234     john@doe.com
1  Jane          Doe                 4321     jane@doe.com
2  John          Doe                 6789   john@gmail.com
3  Jane          Doe                 9876   jane@gmail.com


out = df.astype(str).groupby(['First Name', 'Last Name']).agg(', '.join)
print(out)

                        Phone Number                           Email
First Name  Last Name                                               
Jane         Doe          4321, 9876   jane@doe.com,  jane@gmail.com
John         Doe          1234, 6789   john@doe.com,  john@gmail.com

如果要重置索引,可以使用df.reset_index

执行此操作
out = out.reset_index()
print(out)

    First Name    Last Name Phone Number                           Email
0  Jane          Doe          4321, 9876   jane@doe.com,  jane@gmail.com
1  John          Doe          1234, 6789   john@doe.com,  john@gmail.com

保存到csv很简单,您将使用out.to_csv('file.csv')

附录:删除重复

out = df.astype(str).groupby(['First Name', 'Last Name'])\
                .agg(lambda x: ', '.join(x.drop_duplicates().values))

答案 1 :(得分:0)

对于看起来像这样的csv文件(带有一些格式以删除不必要的空格):

First Name|Last Name|Phone Number|Email
John|Doe|1234|john@doe.com
Jane|Doe|4321|jane@doe.com
John|Doe|6789|john@gmail.com
Jane|Doe|9876|jane@gmail.com

您可以按如下方式使用pandas来组合相似的列(基于名字和姓氏):

import pandas as pd

df = pd.read_csv("/tmp/test.csv", sep="|")
df_combined = df.groupby(["First Name", "Last Name"], as_index=False).agg({"Phone Number":lambda x: ', '.join(str(i) for i in list(x)), "Email": lambda x: ', '.join(str(i) for i in list(x))})
df_combined.to_csv("/tmp/combined_data.csv", sep="|", index=False)

输出文件如下所示:

First Name|Last Name|Phone Number|Email
Jane|Doe|4321, 9876|jane@doe.com, jane@gmail.com
John|Doe|1234, 6789|john@doe.com, john@gmail.com
相关问题