用多个时间序列操作df到数组(填充缺少日期)

时间:2017-05-15 14:01:57

标签: python arrays pandas numpy

我有一个相对较大的df(10 ^ 6条记录)结构如下:

Date,SN,Zip Code,A,B,Total,Lat,Lon
2015-09-01,10948.0,80015,0,0,1,39.626999999999995,-104.779
2015-09-01,11906.0,85392,0,0,1,33.478,-112.309
2015-09-03,10948.0,85260,0,0,1,33.611,-111.891
2015-09-03,11906.0,85050,0,0,1,33.683,-111.99799999999999
2015-09-05,12111.0,23834,0,0,1,37.291,-77.404
2015-09-05,11906.0,72761,0,0,1,36.169000000000004,-94.455

请注意,每个SN(唯一标识符)每天最多最多 1条记录。有些日子,有些SN没有记录,这意味着当天Total为0。我想把这个df转换成一个numpy数组,它会显示每天(行)和Total(列)的SN,但填写{{1}所缺少的日期带有0。

1 个答案:

答案 0 :(得分:1)

您需要pivot

df.pivot('Date', 'SN', 'Total').fillna(0)

#SN         10948.0 11906.0 12111.0
#Date           
#2015-09-01     1.0     1.0     0.0
#2015-09-03     1.0     1.0     0.0
#2015-09-05     0.0     1.0     1.0

获取numpy数组:

df.pivot('Date', 'SN', 'Total').fillna(0).values
#array([[ 1.,  1.,  0.],
#       [ 1.,  1.,  0.],
#       [ 0.,  1.,  1.]])

更新以获取所有日期,您可以使用reindex

# convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date)
​
# pivot to wide format
df1 = df.pivot('Date', 'SN', 'Total').fillna(0)
​
# reindex to get all dates
df1.reindex(pd.date_range(df1.index.min(), df1.index.max())).fillna(0)

#        SN 10948.0 11906.0 12111.0
#2015-09-01     1.0     1.0     0.0
#2015-09-02     0.0     0.0     0.0
#2015-09-03     1.0     1.0     0.0
#2015-09-04     0.0     0.0     0.0
#2015-09-05     0.0     1.0     1.0