以下数据集是我遇到的问题的一个虚拟示例。我的数据中有 3 列,即:Date
PlayerName
和 Score
。因此每个玩家的日期得分被记录。
任务是从满足以下两个条件的玩家子集中找到具有最大总得分(在所有观察中)的玩家:
数据框看起来像:
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d") #toy date column
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(100,150,270,300,400,
100,120,200,400,900,
100,80,130,70,300,
100,120,230,650,870,
100,90,110,450,342)
df <- data.table(date=date,Name=PlayerName,score=score)
> df
date Name score
1: 2010-01-01 A 100
2: 2010-02-02 A 150
3: 2011-01-01 A 270
4: 2011-02-02 A 300
5: 2012-01-01 A 400
6: 2012-02-02 A 100
7: 2013-01-01 A 120
8: 2013-02-02 A 200
9: 2014-01-01 A 400
10: 2014-02-02 A 900
11: 2010-01-01 B 100
12: 2010-02-02 B 80
13: 2011-01-01 B 130
14: 2011-02-02 B 70
15: 2012-01-01 B 300
16: 2012-02-02 B 100
17: 2013-01-01 B 120
18: 2013-02-02 B 230
19: 2014-01-01 B 650
20: 2014-02-02 B 870
21: 2010-01-01 C 100
22: 2010-02-02 C 90
23: 2011-01-01 C 110
24: 2011-02-02 C 450
25: 2012-01-01 C 342
26: 2012-02-02 C 100
27: 2013-01-01 C 150
28: 2013-02-02 C 270
29: 2014-01-01 C 300
30: 2014-02-02 C 400
31: 2010-01-01 D 100
32: 2010-02-02 D 120
33: 2011-01-01 D 200
34: 2011-02-02 D 400
35: 2012-01-01 D 900
36: 2012-02-02 D 100
37: 2013-01-01 D 80
38: 2013-02-02 D 130
39: 2014-01-01 D 70
40: 2014-02-02 D 300
41: 2010-01-01 E 100
42: 2010-02-02 E 120
43: 2011-01-01 E 230
44: 2011-02-02 E 650
45: 2012-01-01 E 870
46: 2012-02-02 E 100
47: 2013-01-01 E 90
48: 2013-02-02 E 110
49: 2014-01-01 E 450
50: 2014-02-02 E 342
到目前为止我所做的事情如下:
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
> df1
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 650 230 2.83 1.92
20: D 2014 870 650 1.34 2.83
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
现在我知道我需要验证两个条件为
growth_rate
列 player_wise 始终大于 1 的值。growth_rate_lag
列的连续行值大于前一行的患者。但我无法为上述逻辑编码。也可能有另一种方式来研究它。如果有人帮忙,我将不胜感激。提前致谢。
编辑 1 : 我使用的示例不准确。所以一个更新的例子是这样的:
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d")
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(40,60,100,50,70,200,120,180,380,20,
40,60,20,100,150,50,300,100,800,100,
10,90,30,50,100,30,10,60,100,200,
50,50,100,20,200,30,400,60,570,400,
80,20,70,20,100,10,400,50,142,200)
df <- data.table(date=date,Name=Name,score=score)
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 460 230 2.00 1.92
20: D 2014 970 460 2.11 2.00
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
现在显然玩家 A,B,D 满足条件 1,但只有 B 和 D 满足条件 2。由于 D 的 total_score
最高,答案是 D。
答案 0 :(得分:0)
你需要这样的东西吗?
df %>% group_by(Name) %>%
mutate(grth = (score - lag(score))/lag(score),
grth_grth = (grth - lag(grth))/lag(grth)) %>%
filter(min(grth, na.rm = T) > 0, min(grth_grth, na.rm = T) >0) %>%
summarise(scrore = sum(score))
# A tibble: 0 x 2
# ... with 2 variables: Name <chr>, scrore <dbl>
表示没有玩家符合条件
答案 1 :(得分:0)
我相信您提供了一个不太好的示例数据。也就是说,dplyr 的可能解决方案(我不熟悉 data.table):
data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
positive_growth_rate=ifelse(is.na(lag(score))), TRUE,
score/lag(score)>=1)%>%
ungroup
这将创建两个额外的逻辑列。 然后你可以过滤所需的子集:
data%>%filter(steady_growth & positive_growth_rate)
在您的示例中给出了一个零行的 data.frame
一键通:
data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
positive_growth_rate=ifelse(is.na(lag(score))), TRUE,
score/lag(score)>=1)%>%
filter(steady_growth & positive_growth_rate)
请注意,对于给定的玩家,stable_growth 列都是 TRUE 或 FALSE。
答案 2 :(得分:0)
使用 data.table
,您可以使用 cumsum
选择玩家,直到去年它实现了更高的分数增长率:
df1[,selected :=cumsum(fifelse(growth_rate>growth_rate_lag|is.na(growth_rate_lag),1L,NA_integer_)),by=Name]
df1[selected>0]
Name year total_score total_score_lag growth_rate growth_rate_lag selected
1: A 2010 250 NA NA NA 1
2: A 2011 570 250 2.28 NA 2
3: B 2010 180 NA NA NA 1
4: B 2011 200 180 1.11 NA 2
5: B 2012 400 200 2.00 1.11 3
6: C 2010 190 NA NA NA 1
7: C 2011 560 190 2.95 NA 2
8: D 2010 220 NA NA NA 1
9: D 2011 600 220 2.73 NA 2
10: E 2010 220 NA NA NA 1
11: E 2011 880 220 4.00 NA 2
正如其他答案中所指出的,在这个数据集中,没有玩家达到增加的速度。