Pandas Optimization - 如何在两列中处理Pandas时间序列数据并在列之间获取每小时数据

时间:2018-06-14 17:49:12

标签: python pandas time-series

我正在尝试获取员工的考勤卡数据,每一行都有冲头start_time和end_time,介于两者之间的时间可以介于0分钟到9小时之间。我希望每行能获得员工每小时工作的时间。 我可以通过以下方式完成:

import pandas as pd 
import numpy as np
pd.options.display.max_columns = 100

store_id = np.repeat(1,10)
employee = [1,2,3,1,2,3,1,2,3,4]
start_time = pd.date_range('2015-07-03', periods=10, freq='25T')
end_time = pd.date_range('2015-07-03', periods = 10,freq = '40T')
df = pd.DataFrame({'store_id':store_id,'employee':employee,'start_time':start_time,'end_time':end_time})
df.head()
    employee    end_time    start_time  store_id
0   1   2015-07-03 00:00:00 2015-07-03 00:00:00 1
1   2   2015-07-03 00:40:00 2015-07-03 00:25:00 1
2   3   2015-07-03 01:20:00 2015-07-03 00:50:00 1
3   1   2015-07-03 02:00:00 2015-07-03 01:15:00 1
4   2   2015-07-03 02:40:00 2015-07-03 01:40:00 1

df['date']=df['start_time'].dt.date

def shift_time_in_hr(row):
    #hrs- Dictionary assigning each hour as a key to the time worked in that particular hour
    hrs = dict(zip(np.arange(0,25),[pd.Timedelta(0)]*24))
    #Case 1: if the start time and end time in the same hour then assign minutes to the start hour
    if row['start_time'].hour == row['end_time'].hour:
        hrs[row['start_time'].hour]= row['end_time']-row['start_time']        
        return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
    else:
        hrs_worked = np.arange(row['start_time'].hour,row['end_time'].hour+1)
        #Case 2: If the start_time and end_time are in different Hours and if there are more hours in between assign them with 60 minutes
        if len(hrs_worked)>2:
            for i in range(hrs_worked[0]+1,hrs_worked[-1]):
                hrs[i] = pd.Timedelta('1 Hour')
        #Assign start_time and end_time minutes to respective hours
        hrs[hrs_worked[0]] =  pd.Timedelta('{} Minutes'.format(60-row['start_time'].minute))
        #hrs[hrs_worked[0]] =  60-row['start_time'].minute
        hrs[hrs_worked[-1]]= pd.Timedelta('{} Minutes'.format(row['end_time'].minute))
        return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))

df=df.apply(shift_time_in_hr,axis = 1)

df.head()
employee    end_time    start_time  store_id    date    0_hr    1_hr    2_hr    3_hr    4_hr    5_hr    6_hr    7_hr    8_hr    9_hr    10_hr   11_hr   12_hr   13_hr   14_hr   15_hr   16_hr   17_hr   18_hr   19_hr   20_hr   21_hr   22_hr   23_hr
0   1   2015-07-03 00:00:00 2015-07-03 00:00:00 1   2015-07-03  00:00:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
1   2   2015-07-03 00:40:00 2015-07-03 00:25:00 1   2015-07-03  00:15:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
2   3   2015-07-03 01:20:00 2015-07-03 00:50:00 1   2015-07-03  00:10:00    00:20:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
3   1   2015-07-03 02:00:00 2015-07-03 01:15:00 1   2015-07-03  00:00:00    00:45:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
4   2   2015-07-03 02:40:00 2015-07-03 01:40:00 1   2015-07-03  00:00:00    00:20:00    00:40:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days

dict_agg= dict(zip(df.columns[5:],[np.sum]*24))
df.groupby(['store_id','employee','date']).agg(dict_agg)

预期产出: 在一天中,员工每小时工作的分钟数。

                                0_hr    1_hr    2_hr    3_hr    4_hr    5_hr    6_hr    7_hr    8_hr    9_hr    10_hr   11_hr   12_hr   13_hr   14_hr   15_hr   16_hr   17_hr   18_hr   19_hr   20_hr   21_hr   22_hr   23_hr
store_id    employee    date                                                                                                
1           1      2015-07-03   00:00:00    00:45:00    00:30:00    01:00:00    00:00:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            2      2015-07-03   00:15:00    00:20:00    00:45:00    01:00:00    00:40:00    00:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            3      2015-07-03   00:10:00    00:20:00    00:55:00    01:00:00    01:00:00    00:20:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days
            4      2015-07-03   00:00:00    00:00:00    00:00:00    00:15:00    01:00:00    01:00:00    0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days  0 days

有没有更好的方法可以做到这一点,或者更多的Pythonic或Pandas方式以一种简单的方式实现我能够做到的事情!

1 个答案:

答案 0 :(得分:1)

这不是一个完整的答案,而是一个可以使用的构建块。

让我们计算一下开始时间和结束时间戳的分钟数,希望以更加熊猫般的方式:

import pandas as pd

def diff(ts):
    ts[ts < pd.Timedelta(0)] = pd.Timedelta(0) 
    return (ts - ts.shift(1)).fillna(0)

def calculate_time_worked(start, end):
    _range = pd.date_range(start=start.date(), 
                           end=end.date()+pd.Timedelta('1D'), 
                           freq='H')
    base = pd.Series(_range)
    time_worked = diff(base-start) - diff(base-end) 
    time_worked = time_worked.apply(lambda x: x.total_seconds() / 60)
    time_worked.index =  base
    return time_worked[time_worked > 0]


start = pd.Timestamp('2017-06-13 20:11')
end = pd.Timestamp('2017-06-13 22:35')
time_worked = calculate_time_worked(start, end)
assert time_worked.to_dict() == {
    pd.Timestamp('2017-06-13 21:00'): 49.0,
    pd.Timestamp('2017-06-13 22:00'): 60.0,
    pd.Timestamp('2017-06-13 23:00'): 35.0}

有不同的方法可以使用该函数 - 例如,产生元组或(timestamp,time_worked,id,store)的dicts并构建一个工作时间段的平面数据框,然后在连接中重塑所需的格式操作。随意构建此代码,并希望它是有用的。

相关问题