我的数据如下:
ID Date_off Approved
1 3/7/2018 1
1 3/8/2018 1
2 3/29/2018 1
2 3/30/2018 1
3 3/30/2018 1
4 4/2/2018 1
5 4/2/2018 1
6 4/2/2018 1
7 4/2/2018 1
8 4/2/2018 1
4 4/3/2018 1
5 4/3/2018 1
6 4/3/2018 1
7 4/3/2018 1
8 4/3/2018 1
4 4/4/2018 1
9 4/4/2018 1
5 4/4/2018 1
10 4/4/2018 1
6 4/4/2018 1
8 4/4/2018 1
9 4/5/2018 1
11 4/5/2018 1
5 4/5/2018 1
10 4/5/2018 1
6 4/5/2018 1
12 4/5/2018 1
8 4/5/2018 1
13 4/10/2018 1
14 4/10/2018 1
15 4/10/2018 1
16 4/10/2018 1
17 4/11/2018 1
13 4/11/2018 1
15 4/11/2018 1
18 4/11/2018 1
16 4/11/2018 1
15 4/12/2018 1
16 4/12/2018 1
16 4/13/2018 1
19 4/16/2018 1
19 4/17/2018 1
20 4/17/2018 1
21 4/18/2018 1
19 4/18/2018 1
20 4/18/2018 1
21 4/19/2018 1
19 4/19/2018 1
20 4/19/2018 1
22 4/20/2018 1
21 4/20/2018 1
20 4/20/2018 1
22 4/23/2018 1
23 4/23/2018 1
22 4/24/2018 1
23 4/24/2018 1
22 4/25/2018 1
23 4/25/2018 1
22 4/26/2018 1
23 4/26/2018 1
14 4/26/2018 1
22 4/27/2018 1
10 4/27/2018 1
23 4/27/2018 1
5 4/27/2018 1
14 4/27/2018 1
8 4/30/2018 1
1 4/30/2018 1
10 4/30/2018 1
23 4/30/2018 1
11 4/30/2018 1
24 4/30/2018 1
25 4/30/2018 1
14 4/30/2018 1
列出了某人度假的所有日子。我需要按姓名分组,并返回有人在度假,第一天和最后一天的连续日数。 所以我在寻找:
Name Days From To
Bob 2 2017-04-06 2017-04-07
Jimy 2 2017-04-07 2017-04-08
Jimy 1 2017-04-10 2017-04-10
....
目前Scott的代码返回了这个:
Approved Date_off Date_off
sum amin amax
ID
1 3 3/7/2018 4/30/2018
2 2 3/29/2018 3/30/2018
3 1 3/30/2018 3/30/2018
4 1 4/2/2018 4/2/2018
4 2 4/3/2018 4/4/2018
5 1 4/2/2018 4/2/2018
5 1 4/3/2018 4/3/2018
5 3 4/4/2018 4/27/2018
6 1 4/2/2018 4/2/2018
6 1 4/3/2018 4/3/2018
6 2 4/4/2018 4/5/2018
7 2 4/2/2018 4/3/2018
8 1 4/2/2018 4/2/2018
8 1 4/3/2018 4/3/2018
8 3 4/4/2018 4/30/2018
9 2 4/4/2018 4/5/2018
10 4 4/4/2018 4/30/2018
11 2 4/5/2018 4/30/2018
12 1 4/5/2018 4/5/2018
13 2 4/10/2018 4/11/2018
14 1 4/10/2018 4/10/2018
14 3 4/26/2018 4/30/2018
15 1 4/10/2018 4/10/2018
15 2 4/11/2018 4/12/2018
16 1 4/10/2018 4/10/2018
16 1 4/11/2018 4/11/2018
16 2 4/12/2018 4/13/2018
17 1 4/11/2018 4/11/2018
18 1 4/11/2018 4/11/2018
19 1 4/16/2018 4/16/2018
19 1 4/17/2018 4/17/2018
19 2 4/18/2018 4/19/2018
20 1 4/17/2018 4/17/2018
20 1 4/18/2018 4/18/2018
20 2 4/19/2018 4/20/2018
21 1 4/18/2018 4/18/2018
21 2 4/19/2018 4/20/2018
22 1 4/20/2018 4/20/2018
22 1 4/23/2018 4/23/2018
22 1 4/24/2018 4/24/2018
22 1 4/25/2018 4/25/2018
22 2 4/26/2018 4/27/2018
23 1 4/23/2018 4/23/2018
23 1 4/24/2018 4/24/2018
23 1 4/25/2018 4/25/2018
23 3 4/26/2018 4/30/2018
24 1 4/30/2018 4/30/2018
25 1 4/30/2018 4/30/2018
我只改变了最后一行:
df.groupby(['ID','cons']).agg({'Approved':sum, 'Date_off':[np.min, np.max]}).reset_index(level=1,drop=True)
如果没有最小/最大日期列,则返回相同的结果。 工作日和周末可能导致问题?这应该算作星期一和星期五作为相邻天数减去的相邻天数。
答案 0 :(得分:1)
df['daysdiff'] = df.sort_values('Date').groupby(['Name'])['Date'].diff()
# I modified this line
df['cons'] = (df.sort_values('Date_off').groupby('ID')['daysdiff']
.apply(lambda x: (x.bfill().dt.days != 1).cumsum()))
print(df.groupby(['ID','cons'])['Approved','Date_off']
.agg({'Approved':'sum','Date_off':['min','max']})
.reset_index(level=1, drop=True))
输出:
Approved Date_off
sum min max
ID
1 2 2018-03-07 2018-03-08
1 1 2018-04-30 2018-04-30
2 2 2018-03-29 2018-03-30
3 1 2018-03-30 2018-03-30
4 3 2018-04-02 2018-04-04
5 4 2018-04-02 2018-04-05
5 1 2018-04-27 2018-04-27
6 4 2018-04-02 2018-04-05
7 2 2018-04-02 2018-04-03
8 4 2018-04-02 2018-04-05
8 1 2018-04-30 2018-04-30
9 2 2018-04-04 2018-04-05
10 2 2018-04-04 2018-04-05
10 1 2018-04-27 2018-04-27
10 1 2018-04-30 2018-04-30
11 1 2018-04-05 2018-04-05
11 1 2018-04-30 2018-04-30
12 1 2018-04-05 2018-04-05
13 2 2018-04-10 2018-04-11
14 1 2018-04-10 2018-04-10
14 2 2018-04-26 2018-04-27
14 1 2018-04-30 2018-04-30
15 3 2018-04-10 2018-04-12
16 4 2018-04-10 2018-04-13
17 1 2018-04-11 2018-04-11
18 1 2018-04-11 2018-04-11
19 4 2018-04-16 2018-04-19
20 4 2018-04-17 2018-04-20
21 3 2018-04-18 2018-04-20
22 1 2018-04-20 2018-04-20
22 5 2018-04-23 2018-04-27
23 5 2018-04-23 2018-04-27
23 1 2018-04-30 2018-04-30
24 1 2018-04-30 2018-04-30
25 1 2018-04-30 2018-04-30