我需要在pandas.Dataframe中记录记录(传感器数据),但我只需要保留最后24小时的记录。每秒都有新的记录。
记录的格式为:
{'Date': ..., 'Sensor1': 10, 'Sensor2': 12, ...}
其中'Date'也应该是DataFrame的索引。
当然,可以使用:
df = df.append( newRecord )
df.drop( df[df.Date < datetime.now() - timedelta( hours=24 )].index] )
但我觉得这很难看。
最有效率和最好的熊猫方式是什么?
答案 0 :(得分:2)
我认为您可以使用subset
和boolean indexing来删除行,但这不是最快的方法。您可以将列Date
设置为index
,然后按时间DataFrame
切换end
。
import pandas as pd
import datetime as datetime
#create testing DataFrame
def format_time():
t = datetime.datetime.now()
s = t.strftime('%Y-%m-%d %H:%M:%S')
return pd.to_datetime(s)
start = format_time()
print start
2016-03-13 09:12:44
N = 85000
df = pd.DataFrame({'Date': pd.Series(pd.date_range(start - pd.Timedelta(days=1, minutes=20) , periods=N, freq='s')), 'a': range(N)})
print df.head()
Date a
0 2016-03-12 08:52:44 0
1 2016-03-12 08:52:45 1
2 2016-03-12 08:52:46 2
3 2016-03-12 08:52:47 3
4 2016-03-12 08:52:48 4
#set index from column Date
df = df.set_index('Date')
#print df
#find chopping time
end = start - pd.Timedelta(days=1)
print end
2016-03-12 09:12:44
#boolean indexing
df1 = df[(df.index >= end ) & (df.index <= start)]
#chopping method
df2 = df[end:]
#test equality
print df1.equals(df2)
True
测试:
In [87]: %timeit df[(df.index >= end ) & (df.index <= start)]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.75 ms per loop
In [88]: %timeit df[end:]
The slowest run took 6.84 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 120 µs per loop
答案 1 :(得分:1)
每秒重新组织所有数据帧,这是一项代价高昂的操作:
In [6]: %timeit df.drop(4)
10 loops, best of 3: 17.3 ms per loop
这里可以避免使用固定的滚动缓冲区来有效存储传感器数据。索引只是一个整数,一天一个。
aday=24*3600
date=pd.date_range('00:00:00', periods=aday, freq='S')
df=pd.DataFrame({'Date':date,'Sensor1':rand(aday),'Sensor2':rand(aday)})
这样添加样本非常快:
sample={'Date': pd.Timestamp('2016-12-04 12:00:00'), 'Sensor1': .1, 'Sensor2': .2}
def indexer(t):
return t.hour*3600+t.minute*60+t.second
def set(df,sample):
date=sample['Date']
index=indexer(date)
df.iat[index,0]=sample['Date']
df.iat[index,1]=sample['Sensor1']
df.iat[index,2]=sample['Sensor2']
In [7]: %timeit set(df,sample)
1000 loops, best of 3: 141 µs per loop
转储当前最近24小时,只需执行:
dfnow=df.set_index(df['Date']).sort_index().copy()
时间现在是指数。