pandas将数据分块到特定时间段进行分析的最佳方法是什么?
我有一个数据集,每行代表1秒,并希望找到传递的任何特定值的最高平均值。
我尝试过重新采样,但这并不是每秒迭代一次作为起点...例如,如果我有2分钟的数据并且比较30秒的间隔,我希望能够比较0-30秒,1-31,2-32等......通过重新采样,我只得到0-30,30-60,60-90和90-120。
# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for everything after 5 mins
interval_lengths += [i for i in range(330, df_samples['ride_length'].max() + 1, 30)]
latest_df = df_samples[df_samples['workoutId'] == df_samples.loc[df_samples.index.max]['workoutId']]
best_interval_df = pd.DataFrame()
latest_interval_df = pd.DataFrame()
# Resample by intervals and get max power for each interval
for i in interval_lengths:
resample_chunk = str(i) + 'S'
# Get interals for all time
best_samplechunks = df_samples.groupby(['workoutId']).resample(resample_chunk).mean().reset_index()
best_samplechunks['interval'] = resample_chunk[:-1]
# Add max power for given interval to df
best_interval_df = best_interval_df.append(best_samplechunks.loc[best_samplechunks['power'].idxmax()])
# Get interals for latest workout
latest_samplechunks = latest_df.groupby(['workoutId']).resample(resample_chunk).mean().reset_index()
latest_samplechunks['interval'] = resample_chunk[:-1]
# Add max power for given interval to df
latest_interval_df = latest_interval_df.append(latest_samplechunks.loc[latest_samplechunks['power'].idxmax()])
更新
以下是数据的链接: https://www.dropbox.com/s/f8vd8lducriki5l/sample.csv?dl=0
另外,我尝试使用rolling()进行设置...但不要认为我得到了正确的结果:
df_samples = pd.read_csv('sample.csv')
# 1 second intervals from 0-60 seconds
interval_lengths = [i for i in range(1, 61)]
# 15 second intervals from 1:15 - 5:00 mins
interval_lengths += [i for i in range(75, 301, 15)]
# 30 second intervals for everything after 5 mins
interval_lengths += [i for i in range(330, df_samples['ride_length'].max() + 1, 30)]
intervals = df_samples
intervals['power'] = intervals['power'].interpolate()
latest_df = intervals[intervals['workoutId'] == intervals.loc[intervals.index.max]['workoutId']]
best_interval_df = pd.DataFrame()
latest_interval_df = pd.DataFrame()
for i in interval_lengths:
# Get interals for all time
temp_df = intervals
temp_df['best_power'] = intervals.groupby(['workoutId'])['power'].rolling(int(i),min_periods=1).mean().reset_index(0,drop=True)
temp_df['interval'] = i
best_interval_df = best_interval_df.append(temp_df.loc[temp_df['best_power'].idxmax()])
latest_temp_df = latest_df
latest_temp_df['best_power'] = latest_df.groupby(['workoutId'])['power'].rolling(int(i),min_periods=1).mean().reset_index(0, drop=True)
latest_temp_df['interval'] = i
latest_interval_df = latest_interval_df.append(latest_temp_df.loc[latest_temp_df['best_power'].idxmax()])
best_interval_df = best_interval_df.set_index('interval')
latest_interval_df = latest_interval_df.set_index('interval')