解析复杂的csv文件

时间:2014-08-13 14:03:24

标签: python csv pandas

我有一个CSV文件,它将每个国家/地区映射到某个值,但问题是它没有很好地形成,它的标题有重复的模式:国家/地区,金额,国家/地区,金额,...... 。(此处金额衡量的是不同的东西,例如自杀率,酒精消费量等,请注意,对于某些国家/地区的数据缺失),请参阅输入DataFrame:df_in

我希望将国家/地区作为索引以及那些' Amounts'作为列,请参阅输出DataFrame,df_out

df_in = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/input.csv', sep = ';', header = 0, index_col = None,
             na_values = [''], mangle_dupe_cols = False)

df_out = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/output.csv', sep = ';', header = 0, index_col = None,
             na_values = [''], mangle_dupe_cols = False)

我原本以为我首先从输入中获取所有唯一的国家/地区(例如,将其作为新的空数据框架的索引)

col_pat = df_in.columns[df_in.columns.to_series().str.contains('Countries')]
cntry = df_in.ix[:, col_pat]
un_elm = pd.Series(map(str, pd.unique(cntry.values.ravel())))
countries = un_elm[un_elm != 'nan']

然后开始拆分主DataFrame(Counrtries as index和Amount as column)并将其累加到空DataFrame。 还有其他想法,谢谢?

1 个答案:

答案 0 :(得分:0)

首先使用.ix根据位置选择列

df_in = pd.read_csv('https://dl.dropboxusercontent.com/u/40513206/input.csv', sep = ';',      header = 0, index_col = None,
         na_values = [''], mangle_dupe_cols = False)

df1 = df_in.ix[:,:2].dropna().set_index('Countries1')
df2 = df_in.ix[:,2:4].dropna().set_index('Countries2')
df3 = df_in.ix[:,4:].dropna().set_index('Countries3')

然后在轴1上连接:

pd.concat([df1,df2,df3], axis=1)


               Amount  Amount  Amount
Austria           NaN       5     NaN
Denmark             6     NaN     NaN
France              3     NaN     NaN
Ireland           NaN     NaN       6
Norway            NaN       2     NaN
Russia            NaN     NaN       5
Slovenia          NaN     NaN       4
Spain             NaN       3       3
Sweden              5       1       2
Switzerland         4       4     NaN
U.K.                1     NaN     NaN
United States       2     NaN       1