过滤器 R data.frame 列具有按另一列排序的递增增长率

时间:2021-04-24 15:56:55

标签: r dataframe time-series data.table lag

以下数据集是我遇到的问题的一个虚拟示例。我的数据中有 3 列,即:Date PlayerNameScore。因此每个玩家的日期得分被记录。 任务是从满足以下两个条件的玩家子集中找到具有最大总得分(在所有观察中)的玩家:

  1. 球员的年度表现应该有稳定的增长(意味着球员每年的总得分应该大于上一年的得分)
  2. 绩效的增长率也应该提高(意味着每年总分的增长率也应该随着时间的推移而增加)

数据框看起来像:

date <- as.Date(x = c('2010/01/01','2010/02/02',
                      '2011/01/01','2011/02/02',
                      '2012/01/01','2012/02/02',
                      '2013/01/01','2013/02/02',
                      '2014/01/01','2014/02/02'),format = "%Y/%m/%d") #toy date column

PlayerName  <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(100,150,270,300,400,
           100,120,200,400,900,
           100,80,130,70,300,
           100,120,230,650,870,
           100,90,110,450,342)
df <- data.table(date=date,Name=PlayerName,score=score)

> df
          date Name score
 1: 2010-01-01    A   100
 2: 2010-02-02    A   150
 3: 2011-01-01    A   270
 4: 2011-02-02    A   300
 5: 2012-01-01    A   400
 6: 2012-02-02    A   100
 7: 2013-01-01    A   120
 8: 2013-02-02    A   200
 9: 2014-01-01    A   400
10: 2014-02-02    A   900
11: 2010-01-01    B   100
12: 2010-02-02    B    80
13: 2011-01-01    B   130
14: 2011-02-02    B    70
15: 2012-01-01    B   300
16: 2012-02-02    B   100
17: 2013-01-01    B   120
18: 2013-02-02    B   230
19: 2014-01-01    B   650
20: 2014-02-02    B   870
21: 2010-01-01    C   100
22: 2010-02-02    C    90
23: 2011-01-01    C   110
24: 2011-02-02    C   450
25: 2012-01-01    C   342
26: 2012-02-02    C   100
27: 2013-01-01    C   150
28: 2013-02-02    C   270
29: 2014-01-01    C   300
30: 2014-02-02    C   400
31: 2010-01-01    D   100
32: 2010-02-02    D   120
33: 2011-01-01    D   200
34: 2011-02-02    D   400
35: 2012-01-01    D   900
36: 2012-02-02    D   100
37: 2013-01-01    D    80
38: 2013-02-02    D   130
39: 2014-01-01    D    70
40: 2014-02-02    D   300
41: 2010-01-01    E   100
42: 2010-02-02    E   120
43: 2011-01-01    E   230
44: 2011-02-02    E   650
45: 2012-01-01    E   870
46: 2012-02-02    E   100
47: 2013-01-01    E    90
48: 2013-02-02    E   110
49: 2014-01-01    E   450
50: 2014-02-02    E   342

到目前为止我所做的事情如下:

df[,year := lubridate::year(date)]  # extract the year 

df1 <- df[,.(total_score =sum(score)),.(Name,year)]  # Yearly Aggregated Scores

df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)]  ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)]  ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)]  #### Creates a lag column of growth column

> df1
    Name year total_score total_score_lag growth_rate growth_rate_lag
 1:    A 2010         100              NA          NA              NA
 2:    A 2011         150             100        1.50              NA
 3:    A 2012         270             150        1.80            1.50
 4:    A 2013         300             270        1.11            1.80
 5:    A 2014         400             300        1.33            1.11
 6:    B 2010         100              NA          NA              NA
 7:    B 2011         120             100        1.20              NA
 8:    B 2012         200             120        1.67            1.20
 9:    B 2013         400             200        2.00            1.67
10:    B 2014         900             400        2.25            2.00
11:    C 2010         100              NA          NA              NA
12:    C 2011          80             100        0.80              NA
13:    C 2012         130              80        1.62            0.80
14:    C 2013          70             130        0.54            1.62
15:    C 2014         300              70        4.29            0.54
16:    D 2010         100              NA          NA              NA
17:    D 2011         120             100        1.20              NA
18:    D 2012         230             120        1.92            1.20
19:    D 2013         650             230        2.83            1.92
20:    D 2014         870             650        1.34            2.83
21:    E 2010         100              NA          NA              NA
22:    E 2011          90             100        0.90              NA
23:    E 2012         110              90        1.22            0.90
24:    E 2013         450             110        4.09            1.22
25:    E 2014         342             450        0.76            4.09

现在我知道我需要验证两个条件为

  • 过滤 growth_rate 列 player_wise 始终大于 1 的值。
  • 过滤 growth_rate_lag 列的连续行值大于前一行的患者。

但我无法为上述逻辑编码。也可能有另一种方式来研究它。如果有人帮忙,我将不胜感激。提前致谢。

编辑 1 : 我使用的示例不准确。所以一个更新的例子是这样的:

date <- as.Date(x = c('2010/01/01','2010/02/02',
                      '2011/01/01','2011/02/02',
                      '2012/01/01','2012/02/02',
                      '2013/01/01','2013/02/02',
                      '2014/01/01','2014/02/02'),format = "%Y/%m/%d")

PlayerName  <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(40,60,100,50,70,200,120,180,380,20,
           40,60,20,100,150,50,300,100,800,100,
           10,90,30,50,100,30,10,60,100,200,
           50,50,100,20,200,30,400,60,570,400,
           80,20,70,20,100,10,400,50,142,200)
df <- data.table(date=date,Name=Name,score=score)
df[,year := lubridate::year(date)]  # extract the year 

df1 <- df[,.(total_score =sum(score)),.(Name,year)]  # Yearly Aggregated Scores

df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)]  ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)]  ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)]  #### Creates a lag column of growth column

  Name year total_score total_score_lag growth_rate growth_rate_lag
 1:    A 2010         100              NA          NA              NA
 2:    A 2011         150             100        1.50              NA
 3:    A 2012         270             150        1.80            1.50
 4:    A 2013         300             270        1.11            1.80
 5:    A 2014         400             300        1.33            1.11
 6:    B 2010         100              NA          NA              NA
 7:    B 2011         120             100        1.20              NA
 8:    B 2012         200             120        1.67            1.20
 9:    B 2013         400             200        2.00            1.67
10:    B 2014         900             400        2.25            2.00
11:    C 2010         100              NA          NA              NA
12:    C 2011          80             100        0.80              NA
13:    C 2012         130              80        1.62            0.80
14:    C 2013          70             130        0.54            1.62
15:    C 2014         300              70        4.29            0.54
16:    D 2010         100              NA          NA              NA
17:    D 2011         120             100        1.20              NA
18:    D 2012         230             120        1.92            1.20
19:    D 2013         460             230        2.00            1.92
20:    D 2014         970             460        2.11            2.00
21:    E 2010         100              NA          NA              NA
22:    E 2011          90             100        0.90              NA
23:    E 2012         110              90        1.22            0.90
24:    E 2013         450             110        4.09            1.22
25:    E 2014         342             450        0.76            4.09
    

现在显然玩家 A,B,D 满足条件 1,但只有 B 和 D 满足条件 2。由于 D 的 total_score 最高,答案是 D。

3 个答案:

答案 0 :(得分:0)

你需要这样的东西吗?

df %>% group_by(Name) %>%
  mutate(grth = (score - lag(score))/lag(score),
         grth_grth = (grth - lag(grth))/lag(grth)) %>%
  filter(min(grth, na.rm = T) > 0, min(grth_grth, na.rm = T) >0) %>%
  summarise(scrore = sum(score))
  

# A tibble: 0 x 2
# ... with 2 variables: Name <chr>, scrore <dbl>

表示没有玩家符合条件

答案 1 :(得分:0)

我相信您提供了一个不太好的示例数据。也就是说,dplyr 的可能解决方案(我不熟悉 data.table):

data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
       positive_growth_rate=ifelse(is.na(lag(score))), TRUE, 
                                   score/lag(score)>=1)%>%
ungroup

这将创建两个额外的逻辑列。 然后你可以过滤所需的子集:

data%>%filter(steady_growth & positive_growth_rate)

在您的示例中给出了一个零行的 data.frame

一键通:

data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
       positive_growth_rate=ifelse(is.na(lag(score))), TRUE, 
                                   score/lag(score)>=1)%>%
filter(steady_growth & positive_growth_rate)

请注意,对于给定的玩家,stable_growth 列都是 TRUE 或 FALSE。

答案 2 :(得分:0)

使用 data.table,您可以使用 cumsum 选择玩家,直到去年它实现了更高的分数增长率:

df1[,selected :=cumsum(fifelse(growth_rate>growth_rate_lag|is.na(growth_rate_lag),1L,NA_integer_)),by=Name]
df1[selected>0]

    Name year total_score total_score_lag growth_rate growth_rate_lag selected
 1:    A 2010         250              NA          NA              NA        1
 2:    A 2011         570             250        2.28              NA        2
 3:    B 2010         180              NA          NA              NA        1
 4:    B 2011         200             180        1.11              NA        2
 5:    B 2012         400             200        2.00            1.11        3
 6:    C 2010         190              NA          NA              NA        1
 7:    C 2011         560             190        2.95              NA        2
 8:    D 2010         220              NA          NA              NA        1
 9:    D 2011         600             220        2.73              NA        2
10:    E 2010         220              NA          NA              NA        1
11:    E 2011         880             220        4.00              NA        2

正如其他答案中所指出的,在这个数据集中,没有玩家达到增加的速度。