两个CSV文件的逐列差异

时间:2015-02-13 00:57:23

标签: python csv dictionary compare

我试图比较两个CSV文件(以及下面的更多文件)。我尝试了很多方法,使用列表,dictreader和更多,但没有给我输出我需要的输出。我想比较所有那些具有相同的行!Sample_title和!Sample_geo_accession值(其位置不同)。我现在已经三天苦苦挣扎,无法找到解决方案。我非常感谢任何帮助。

CSV1:

!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1
body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk
foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk
HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk
AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk

CSV2:

!Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession
AsDW,ribonucleic acid,FB_0,GSM501444
foreign,ribonucleic acid,FB,GSM501449
HJCENV,RNA,12wk,GSM501446

所需的输出(相对于CSV2):

添加了:

{!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22 2010}} #Added columns, not rows.

删除:

{} #Since nothing's deleted with respect to CSV2

更改:

{!Sample_title:AsDW,!Sample_source_name_ch1:(FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444
!Sample_title:HJCENV,!Sample_type:(ribonucleic acid,RNA),!Sample_source_name_ch1:(FB_50_12wk,12wk),!Sample_geo_accession:GSM501446}
#foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the !Sample_geo_accession column values didn't match. 

编辑:

下面 添加的字典应该为CSV1中的每个!Sample_title(在CSV1和CSV2中的!Sample_title和!Sample_geo_accession匹配时)提供任何其他列及其值(如果它的列数多于CSV2)

删除的字典与添加类似,只是它查找已删除的列。

Changed提供了文件及其标题中不同的值。

所以基本上它应该比较苹果和苹果(当标题名称匹配时),而不是苹果和橙子(按列位置)

1 个答案:

答案 0 :(得分:1)

你的问题仍然非常严重。首先,我们必须解码这个问题。 您说"区分两个CSV文件",这通常意味着行方式差异,可能首先按索引列进行逐行重新排序['!Sample_title','!Sample_geo_accession& #39]

但实际上你想要列式差异。具体来说,您想知道在csv2中添加了哪些列,删除了哪些列,以及对于公共列,csv2中更改了哪些条目(行)。 现在,您是否希望这些差异由各个系列计算和显示,或同时在所有列中显示?

如下所示:

import pandas as pd
pd.options.display.width = 200

df1 = pd.read_csv('1.csv', index_col=['!Sample_title','!Sample_geo_accession'])
df2 = pd.read_csv('2.csv', index_col=['!Sample_title','!Sample_geo_accession'])

cols_common  = (df1.columns & df2.columns).tolist()
cols_added   = (df2.columns - df1.columns).tolist()
cols_deleted = (df1.columns - df2.columns).tolist()

print "\nAdded",   df2.ix[:, cols_added]
print "\nDeleted", df1.ix[:, cols_deleted]
print "\nChanged", df2.ix[:, cols_common]

输出:

Added:
[(AsDW, GSM501444), (foreign, GSM501449), (HJCENV, GSM501446)]

Deleted                                              !Sample_status
!Sample_title !Sample_geo_accession                        
body          GSM501443              Public on july 22 2010
foreign       GSM501445              Public on july 22 2010
HJCENV        GSM501446              Public on july 22 2010
AsDW          GSM501444              Public on july 22 2010

Changed                                          !Sample_type !Sample_source_name_ch1
!Sample_title !Sample_geo_accession                                          
AsDW          GSM501444              ribonucleic acid                    FB_0
foreign       GSM501449              ribonucleic acid                      FB
HJCENV        GSM501446                           RNA                    12wk

似乎您还需要我们对列进行重新排序,因此df1,df2的顺序相同。 但是你还没告诉我们应该如何比较'!Sample_source_name_ch1',因为' FB_0_12wk' !=' 12wk'。

在你确定你所要求的清晰度之前,我不会继续这样做。