向数据帧添加新列(= index.diff())

时间:2017-09-21 01:48:34

标签: python pandas datetime dataframe

我有一个数据框:

>>> d.head()
Out[11]: 
                                      SOURCE
Time                                        
2017-04-03 09:05:07+08:00                 g
2017-04-03 09:05:09.744000+08:00          h
2017-04-03 09:05:17.168000+08:00          h
2017-04-03 09:05:27.118000+08:00          f
2017-04-03 09:05:55.616000+08:00          r

>>> d.index
Out[17]: 
DatetimeIndex([ '2017-04-03 09:05:07+08:00', '2017-04-03 09:05:09.744000+08:00',...'2017-06-20 04:58:49.685000+08:00'], dtype='datetime64[ns]', name=u'Time', length=783743, freq=None, tz='Asia/Singapore')

我想添加一个新列,它等于连续读数之间的时间差。我正在尝试这些但没有工作:

1

d['timediff']= d.index.diff()

2

temp = pd.DataFrame(d.index)
d['timediff']= temp.diff().iloc[:,0]

3

temp = pd.DataFrame(d.index)
d['timediff']=  pd.Series(temp.diff().iloc[:,0], index=d.index)

4

temp = pd.DataFrame(d.index)
d.assign(td=temp.diff())

所有这些都导致'timediff'专栏中的NaNs。

最后这个有效:

temp = pd.DataFrame(d.index)
temp = temp.diff().iloc[:,0].values
d = d.assign(timediff = temp)

有人可以澄清这里发生了什么吗?仅供参考,这是我得到的temp.diff:

>>> temp.diff().iloc[0:5,0]
Out[13]: 
0                       NaN
1    0 days 00:00:02.744000
2    0 days 00:00:07.424000
3    0 days 00:00:09.950000
4    0 days 00:00:28.498000
Name: Time, dtype: object

此外,我还有另一个(次要)问题 - d读取的索引如'2017-04-03 09:05:09.744000 + 08:00'。这在我转换索引的时区后发生。知道每个指数值中+8:00指的是什么?

2 个答案:

答案 0 :(得分:1)

我认为您首先需要转换index to_series,因为index.diff()尚未实现。

同样需要新Series的原始索引,否则获取NaT s:

print (d.index.to_series())
Time
2017-04-03 09:05:07+08:00          2017-04-03 01:05:07.000
2017-04-03 09:05:09.744000+08:00   2017-04-03 01:05:09.744
2017-04-03 09:05:17.168000+08:00   2017-04-03 01:05:17.168
2017-04-03 09:05:27.118000+08:00   2017-04-03 01:05:27.118
2017-04-03 09:05:55.616000+08:00   2017-04-03 01:05:55.616
Name: Time, dtype: datetime64[ns]

d['diff'] = d.index.to_series().diff()
print (d)
                                 SOURCE            diff
Time                                                   
2017-04-03 09:05:07+08:00             g             NaT
2017-04-03 09:05:09.744000+08:00      h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00      h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00      f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00      r 00:00:28.498000
print (pd.Series(d.index))
0          2017-04-03 09:05:07+08:00
1   2017-04-03 09:05:09.744000+08:00
2   2017-04-03 09:05:17.168000+08:00
3   2017-04-03 09:05:27.118000+08:00
4   2017-04-03 09:05:55.616000+08:00
Name: Time, dtype: datetime64[ns, Asia/Singapore]

d['diff'] = pd.Series(d.index).diff()
print (d)
                                 SOURCE diff
Time                                        
2017-04-03 09:05:07+08:00             g  NaT
2017-04-03 09:05:09.744000+08:00      h  NaT
2017-04-03 09:05:17.168000+08:00      h  NaT
2017-04-03 09:05:27.118000+08:00      f  NaT
2017-04-03 09:05:55.616000+08:00      r  NaT

转换为DataFrame也需要分配index并选择列Series

d['diff'] = pd.DataFrame(d.index, index=d.index)['Time'].diff()
print (d)
                                 SOURCE            diff
Time                                                   
2017-04-03 09:05:07+08:00             g             NaT
2017-04-03 09:05:09.744000+08:00      h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00      h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00      f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00      r 00:00:28.498000
d['diff'] = pd.DataFrame(d.index, index=d.index).iloc[:, 0].diff()
print (d)
                                 SOURCE            diff
Time                                                   
2017-04-03 09:05:07+08:00             g             NaT
2017-04-03 09:05:09.744000+08:00      h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00      h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00      f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00      r 00:00:28.498000

最后一个版本的pandas完美地使用了时区。如果需要将索引转换为UTC需要DatetimeIndex.tz_convertDataFrame.tz_convert

d.index = d.index.tz_convert('UTC')
print (d)
                                 SOURCE
Time                                   
2017-04-03 01:05:07+00:00             g
2017-04-03 01:05:09.744000+00:00      h
2017-04-03 01:05:17.168000+00:00      h
2017-04-03 01:05:27.118000+00:00      f
2017-04-03 01:05:55.616000+00:00      r
d = d.tz_convert('UTC')
print (d)
                                 SOURCE
Time                                   
2017-04-03 01:05:07+00:00             g
2017-04-03 01:05:09.744000+00:00      h
2017-04-03 01:05:17.168000+00:00      h
2017-04-03 01:05:27.118000+00:00      f
2017-04-03 01:05:55.616000+00:00      r

要从timezone移除DatetieIndex

d = d.tz_convert('UTC').tz_localize(None)
print (d)
                        SOURCE
Time                          
2017-04-03 01:05:07.000      g
2017-04-03 01:05:09.744      h
2017-04-03 01:05:17.168      h
2017-04-03 01:05:27.118      f
2017-04-03 01:05:55.616      r

但请注意或仅删除 - 只需删除+8:00并获得不同的时间:

d = d.tz_localize(None)
print (d)
                        SOURCE
Time                          
2017-04-03 09:05:07.000      g
2017-04-03 09:05:09.744      h
2017-04-03 09:05:17.168      h
2017-04-03 09:05:27.118      f
2017-04-03 09:05:55.616      r

见差异:

d = d.tz_convert('UTC').tz_localize(None).tz_localize('UTC').tz_convert('Asia/Singapore')
print (d)
                                 SOURCE
Time                                   
2017-04-03 09:05:07+08:00             g
2017-04-03 09:05:09.744000+08:00      h
2017-04-03 09:05:17.168000+08:00      h
2017-04-03 09:05:27.118000+08:00      f
2017-04-03 09:05:55.616000+08:00      r

VS

d = d.tz_localize(None).tz_localize('UTC').tz_convert('Asia/Singapore')
print (d)
                                 SOURCE
Time                                   
2017-04-03 17:05:07+08:00             g
2017-04-03 17:05:09.744000+08:00      h
2017-04-03 17:05:17.168000+08:00      h
2017-04-03 17:05:27.118000+08:00      f
2017-04-03 17:05:55.616000+08:00      r

答案 1 :(得分:0)

>>>> df
                                 SOURCE
Time                                   
2017-04-03 09:05:07+08:00             g
2017-04-03 09:05:09.744000+08:00      h
2017-04-03 09:05:17.168000+08:00      h
2017-04-03 09:05:27.118000+08:00      f
2017-04-03 09:05:55.616000+08:00      r

由于您无法在索引上使用.diff(),请先将其转换为列:

>>>> df['Time'] = df.index
>>>> df
                                 SOURCE                             Time
Time                                                                    
2017-04-03 09:05:07+08:00             g        2017-04-03 09:05:07+08:00
2017-04-03 09:05:09.744000+08:00      h 2017-04-03 09:05:09.744000+08:00
2017-04-03 09:05:17.168000+08:00      h 2017-04-03 09:05:17.168000+08:00
2017-04-03 09:05:27.118000+08:00      f 2017-04-03 09:05:27.118000+08:00
2017-04-03 09:05:55.616000+08:00      r 2017-04-03 09:05:55.616000+08:00

然后它运作良好:

>>>> df['Time'].diff()

Time
2017-04-03 09:05:07+08:00                      NaT
2017-04-03 09:05:09.744000+08:00   00:00:02.744000
2017-04-03 09:05:17.168000+08:00   00:00:07.424000
2017-04-03 09:05:27.118000+08:00   00:00:09.950000
2017-04-03 09:05:55.616000+08:00   00:00:28.498000
Name: Time, dtype: timedelta64[ns]