Question

我几乎是Pandas的新手，所以我想知道在开始编码之前是否可以进行某项操作。

我有一组员工工作时间的数据，就像这样（这些都是透明的，真实的东西是成千上万的记录）

    ID      Name    Date        Hour    Type
0   123     Bob     01/01/2018  09:00   In
1   123     Bob     01/01/2018  09:30   Out
2   123     Bob     01/01/2018  10:00   In
3   123     Bob     01/01/2018  12:00   Out
4   123     Bob     01/01/2018  13:00   In
5   123     Bob     01/01/2018  17:00   Out
6   456     Max     01/01/2018  09:00   In
7   456     Max     01/01/2018  12:00   Out
8   456     Max     01/01/2018  13:00   In
9   456     Max     01/01/2018  17:00   Out
10  123     Bob     02/01/2018  09:00   In
11  123     Bob     02/01/2018  09:30   Out
12  123     Bob     02/01/2018  10:00   In
13  123     Bob     02/01/2018  17:00   Out
14  456     Max     02/01/2018  10:00   In
15  456     Max     02/01/2018  17:00   Out

我知道Python和Pandas在处理数据方面有多么强大，我想知道是否有必要在不进行迭代编码的情况下获得这种输出

    ID      Name    Date        HourWorked
0   123     Bob     01/01/2018  06:30
1   456     Max     01/01/2018  07:00
2   123     Bob     02/01/2018  07:30
3   456     Max     02/01/2018  07:00

最后，我需要（每个员工ID）计算每一天工作的小时/分钟

我观看了很多GroupBy示例，但我发现任何有用的东西。

TIA

Answer 1

将小时数转换为datetime，groupby输入和输出＆＃39;并采取差异。稍后将'ID'和'Date'的差异分组，即

df['Hour'] = pd.to_datetime(df['Hour'])

df['diff'] = df.groupby((df['Type'] == 'In').cumsum())['Hour'].diff()

df_new = df.groupby(['ID','Name','Date'])['diff'].sum().to_frame('Hours Worked')

                    Hours Worked
ID  Name Date                   
123 Bob  01/01/2018     06:30:00
         02/01/2018     07:30:00
456 Max  01/01/2018     07:00:00
         02/01/2018     07:00:00

Answer 2

使用groupby +自定义功能。这假定你的“In”＆amp; “Out”时间正确配对和排序。

# convert series to timedelta
df['Hour'] = pd.to_timedelta(df['Hour']+':00')

# define total time calculation
def total_time(x):
    return (x.iloc[1::2].values - x.iloc[::2].values).sum()

# apply groupby and convert to dataframe
res = df.groupby(['ID', 'Name', 'Date'])['Hour'].apply(total_time)\
        .to_frame('Hours Worked').reset_index()

print(res)

    ID Name        Date  Hours Worked
0  123  Bob  01/01/2018      06:30:00
1  123  Bob  02/01/2018      07:30:00
2  456  Max  01/01/2018      07:00:00
3  456  Max  02/01/2018      07:00:00

Answer 3

但是，此解决方案假设您的Type始终位于＆＃34; In-Out＆＃34;订单

df = pd.DataFrame({"ID": [123,123,123,123,456,456, 123,123, 456,456],
                   "Date": ["01/01/2018","01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018", 
                       "02/01/2018", "02/01/2018", "02/01/2018", "02/01/2018"],
                   "Hour": ["09:00","09:30","10:00","12:00","13:00","17:00", "10:00","12:00","13:00","17:00"],
                   "Type": ["In","Out","In","Out","In","Out", "In","Out","In","Out"]})
df["DateTime"] = pd.to_datetime(df["Hour"] + " " + df["Date"])
df.groupby(["ID", "Date"])["DateTime"].apply(list).\
                                       apply(lambda x: [x[i+1] - x[i] for i in range(len(x) - 1)]).str[0::2].\
                                       apply(lambda x: np.sum(x))

分组和计算数据

3 个答案: