我正在尝试获取员工的考勤卡数据,每一行都有冲头start_time和end_time,介于两者之间的时间可以介于0分钟到9小时之间。我希望每行能获得员工每小时工作的时间。 我可以通过以下方式完成:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100
store_id = np.repeat(1,10)
employee = [1,2,3,1,2,3,1,2,3,4]
start_time = pd.date_range('2015-07-03', periods=10, freq='25T')
end_time = pd.date_range('2015-07-03', periods = 10,freq = '40T')
df = pd.DataFrame({'store_id':store_id,'employee':employee,'start_time':start_time,'end_time':end_time})
df.head()
employee end_time start_time store_id
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1
df['date']=df['start_time'].dt.date
def shift_time_in_hr(row):
#hrs- Dictionary assigning each hour as a key to the time worked in that particular hour
hrs = dict(zip(np.arange(0,25),[pd.Timedelta(0)]*24))
#Case 1: if the start time and end time in the same hour then assign minutes to the start hour
if row['start_time'].hour == row['end_time'].hour:
hrs[row['start_time'].hour]= row['end_time']-row['start_time']
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
else:
hrs_worked = np.arange(row['start_time'].hour,row['end_time'].hour+1)
#Case 2: If the start_time and end_time are in different Hours and if there are more hours in between assign them with 60 minutes
if len(hrs_worked)>2:
for i in range(hrs_worked[0]+1,hrs_worked[-1]):
hrs[i] = pd.Timedelta('1 Hour')
#Assign start_time and end_time minutes to respective hours
hrs[hrs_worked[0]] = pd.Timedelta('{} Minutes'.format(60-row['start_time'].minute))
#hrs[hrs_worked[0]] = 60-row['start_time'].minute
hrs[hrs_worked[-1]]= pd.Timedelta('{} Minutes'.format(row['end_time'].minute))
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
df=df.apply(shift_time_in_hr,axis = 1)
df.head()
employee end_time start_time store_id date 0_hr 1_hr 2_hr 3_hr 4_hr 5_hr 6_hr 7_hr 8_hr 9_hr 10_hr 11_hr 12_hr 13_hr 14_hr 15_hr 16_hr 17_hr 18_hr 19_hr 20_hr 21_hr 22_hr 23_hr
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1 2015-07-03 00:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1 2015-07-03 00:15:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1 2015-07-03 00:10:00 00:20:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1 2015-07-03 00:00:00 00:45:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1 2015-07-03 00:00:00 00:20:00 00:40:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
dict_agg= dict(zip(df.columns[5:],[np.sum]*24))
df.groupby(['store_id','employee','date']).agg(dict_agg)
预期产出: 在一天中,员工每小时工作的分钟数。
0_hr 1_hr 2_hr 3_hr 4_hr 5_hr 6_hr 7_hr 8_hr 9_hr 10_hr 11_hr 12_hr 13_hr 14_hr 15_hr 16_hr 17_hr 18_hr 19_hr 20_hr 21_hr 22_hr 23_hr
store_id employee date
1 1 2015-07-03 00:00:00 00:45:00 00:30:00 01:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 2015-07-03 00:15:00 00:20:00 00:45:00 01:00:00 00:40:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 2015-07-03 00:10:00 00:20:00 00:55:00 01:00:00 01:00:00 00:20:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2015-07-03 00:00:00 00:00:00 00:00:00 00:15:00 01:00:00 01:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
有没有更好的方法可以做到这一点,或者更多的Pythonic或Pandas方式以一种简单的方式实现我能够做到的事情!
答案 0 :(得分:1)
这不是一个完整的答案,而是一个可以使用的构建块。
让我们计算一下开始时间和结束时间戳的分钟数,希望以更加熊猫般的方式:
import pandas as pd
def diff(ts):
ts[ts < pd.Timedelta(0)] = pd.Timedelta(0)
return (ts - ts.shift(1)).fillna(0)
def calculate_time_worked(start, end):
_range = pd.date_range(start=start.date(),
end=end.date()+pd.Timedelta('1D'),
freq='H')
base = pd.Series(_range)
time_worked = diff(base-start) - diff(base-end)
time_worked = time_worked.apply(lambda x: x.total_seconds() / 60)
time_worked.index = base
return time_worked[time_worked > 0]
start = pd.Timestamp('2017-06-13 20:11')
end = pd.Timestamp('2017-06-13 22:35')
time_worked = calculate_time_worked(start, end)
assert time_worked.to_dict() == {
pd.Timestamp('2017-06-13 21:00'): 49.0,
pd.Timestamp('2017-06-13 22:00'): 60.0,
pd.Timestamp('2017-06-13 23:00'): 35.0}
有不同的方法可以使用该函数 - 例如,产生元组或(timestamp,time_worked,id,store)的dicts并构建一个工作时间段的平面数据框,然后在连接中重塑所需的格式操作。随意构建此代码,并希望它是有用的。