分组和计算数据

时间:2018-06-18 08:16:12

标签: python pandas dataframe pandas-groupby

我几乎是Pandas的新手,所以我想知道在开始编码之前是否可以进行某项操作。

我有一组员工工作时间的数据,就像这样 (这些都是透明的,真实的东西是成千上万的记录)

    ID      Name    Date        Hour    Type
0   123     Bob     01/01/2018  09:00   In
1   123     Bob     01/01/2018  09:30   Out
2   123     Bob     01/01/2018  10:00   In
3   123     Bob     01/01/2018  12:00   Out
4   123     Bob     01/01/2018  13:00   In
5   123     Bob     01/01/2018  17:00   Out
6   456     Max     01/01/2018  09:00   In
7   456     Max     01/01/2018  12:00   Out
8   456     Max     01/01/2018  13:00   In
9   456     Max     01/01/2018  17:00   Out
10  123     Bob     02/01/2018  09:00   In
11  123     Bob     02/01/2018  09:30   Out
12  123     Bob     02/01/2018  10:00   In
13  123     Bob     02/01/2018  17:00   Out
14  456     Max     02/01/2018  10:00   In
15  456     Max     02/01/2018  17:00   Out

我知道Python和Pandas在处理数据方面有多么强大,我想知道是否有必要在不进行迭代编码的情况下获得这种输出

    ID      Name    Date        HourWorked
0   123     Bob     01/01/2018  06:30
1   456     Max     01/01/2018  07:00
2   123     Bob     02/01/2018  07:30
3   456     Max     02/01/2018  07:00

最后,我需要(每个员工ID)计算每一天工作的小时/分钟

我观看了很多GroupBy示例,但我发现任何有用的东西。

TIA

3 个答案:

答案 0 :(得分:4)

将小时数转换为datetimegroupby输入和输出'并采取差异。稍后将'ID''Date'的差异分组,即

df['Hour'] = pd.to_datetime(df['Hour'])

df['diff'] = df.groupby((df['Type'] == 'In').cumsum())['Hour'].diff()

df_new = df.groupby(['ID','Name','Date'])['diff'].sum().to_frame('Hours Worked')

                    Hours Worked
ID  Name Date                   
123 Bob  01/01/2018     06:30:00
         02/01/2018     07:30:00
456 Max  01/01/2018     07:00:00
         02/01/2018     07:00:00

答案 1 :(得分:2)

使用groupby +自定义功能。这假定你的“In”& “Out”时间正确配对和排序。

# convert series to timedelta
df['Hour'] = pd.to_timedelta(df['Hour']+':00')

# define total time calculation
def total_time(x):
    return (x.iloc[1::2].values - x.iloc[::2].values).sum()

# apply groupby and convert to dataframe
res = df.groupby(['ID', 'Name', 'Date'])['Hour'].apply(total_time)\
        .to_frame('Hours Worked').reset_index()

print(res)

    ID Name        Date  Hours Worked
0  123  Bob  01/01/2018      06:30:00
1  123  Bob  02/01/2018      07:30:00
2  456  Max  01/01/2018      07:00:00
3  456  Max  02/01/2018      07:00:00

答案 2 :(得分:0)

但是,此解决方案假设您的Type始终位于" In-Out"订单

df = pd.DataFrame({"ID": [123,123,123,123,456,456, 123,123, 456,456],
                   "Date": ["01/01/2018","01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018", 
                       "02/01/2018", "02/01/2018", "02/01/2018", "02/01/2018"],
                   "Hour": ["09:00","09:30","10:00","12:00","13:00","17:00", "10:00","12:00","13:00","17:00"],
                   "Type": ["In","Out","In","Out","In","Out", "In","Out","In","Out"]})
df["DateTime"] = pd.to_datetime(df["Hour"] + " " + df["Date"])
df.groupby(["ID", "Date"])["DateTime"].apply(list).\
                                       apply(lambda x: [x[i+1] - x[i] for i in range(len(x) - 1)]).str[0::2].\
                                       apply(lambda x: np.sum(x))     
相关问题