以滚动总和

时间:2016-04-14 21:13:47

标签: r dplyr

我正在进行分析,我正在计算一个部分基于滚动天数的费率。我正在使用dplyr和group_by / summary / mutate操作执行此计算。

但是,滚动总和的增量因组而异。理想情况下,我每30天测量一次。但是,有时测量间隔为60或90天。

例如:

df <- data.frame( ID = "Subject A",
                 cumulative_days = c(30, 60, 90, 180, 270, 360),
                 rolling_percent = c(.8, .6, .6, .4, .3, .2))

我想把这个小组改成:

result <- data.frame(ID = "Subject A",
                     month = seq(1,12),
                 rolling_percent = c(.8, .6, .6, NA, NA, .4, NA, NA, .3, NA, NA, .2))

如果我能够达到&#39;结果&#39;上面的数据框,我的计划是利用这里描述的dplyr / zoo解决方案:fill in NA based on the last non-NA value for each group in R

我可以用最后一次非NA观察来填写NA。

换句话说,我希望能够将N个观测值累加到12个观测值中,累计加起来为360。那时,我相信我可以应用其他链接的解决方案来解决我的问题。

我很难清楚地描述这种情况,所以任何有关澄清我的问题的建议都会受到赞赏。

3 个答案:

答案 0 :(得分:2)

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

dt[, .(ID, month = cumulative_days/30, rolling_percent)][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month')]
#           ID month rolling_percent
# 1: Subject A     1             0.8
# 2: Subject A     2             0.6
# 3: Subject A     3             0.6
# 4: Subject A     4              NA
# 5: Subject A     5              NA
# 6: Subject A     6             0.4
# 7: Subject A     7              NA
# 8: Subject A     8              NA
# 9: Subject A     9             0.3
#10: Subject A    10              NA
#11: Subject A    11              NA
#12: Subject A    12             0.2

# or simply make it a rolling join to achieve your desired final result
dt[, .(ID, month = cumulative_days/30, rolling_percent)][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
#           ID month rolling_percent
# 1: Subject A     1             0.8
# 2: Subject A     2             0.6
# 3: Subject A     3             0.6
# 4: Subject A     4             0.6
# 5: Subject A     5             0.6
# 6: Subject A     6             0.4
# 7: Subject A     7             0.4
# 8: Subject A     8             0.4
# 9: Subject A     9             0.3
#10: Subject A    10             0.3
#11: Subject A    11             0.3
#12: Subject A    12             0.2

除了上面的列选择之外,您只需添加一个新的month列:

dt[, month := cumulative_days/30][
   CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
#           ID cumulative_days rolling_percent month
# 1: Subject A              30             0.8     1
# 2: Subject A              60             0.6     2
# 3: Subject A              90             0.6     3
# 4: Subject A              90             0.6     4
# 5: Subject A              90             0.6     5
# 6: Subject A             180             0.4     6
# 7: Subject A             180             0.4     7
# 8: Subject A             180             0.4     8
# 9: Subject A             270             0.3     9
#10: Subject A             270             0.3    10
#11: Subject A             270             0.3    11
#12: Subject A             360             0.2    12

答案 1 :(得分:1)

这是一个将data.frame与完整的

连接起来的解决方案
library(dplyr)
df$month<-df$cumulative_days/30
result<-data.frame(ID = "Subject A",month=seq(1,max(df$month))) %>% left_join(df) %>%
select(-cumulative_days)

如果您要将解决方案应用于不同的ID,例如此假数据集:

df <- data.frame( ID = "Subject A",
              cumulative_days = c(30, 60, 90, 180, 270, 360),
              rolling_percent = c(.8, .6, .6, .4, .3, .2))

df2 <- data.frame( ID = "Subject B",
              cumulative_days = c(30, 90, 120, 180, 270, 360),
              rolling_percent = c(.6, .4, .3, .2, .1, .6))

df<-rbind(df,df2)

你可以将前面的代码声明为函数,然后根据ID分割大数据帧并单独应用函数,最后将所有函数绑定在一起。所以代码就像:

buildDf<-function(df){
 df$month<-df$cumulative_days/30
 data.frame(ID = df$ID[1],month=seq(1,max(df$month))) %>% 
 left_join(df) %>% select(-cumulative_days)
}

listDf<-split(df,f=df$ID)
listDfFiltered<-lapply(listDf,buildDf)
result<-do.call('rbind',listDfFiltered)

希望这有帮助

答案 2 :(得分:1)

我们可以使用base R执行此操作。通过除以30来创建“月份”列。然后,使用expand.grid获取包含“ID”和“range合并”,的所有组合的data.frame使用原始数据集,以便为“ID”,“月份”组合的'rolling_percent'获取NA,这是'df'中找不到的。

df$month <-df$cumulative_days/30
merge(expand.grid(ID = unique(df$ID), 
       month=Reduce(`:`, range(df$month))), df[-2], all.x=TRUE)
#          ID month rolling_percent
#1  Subject A     1             0.8
#2  Subject A     2             0.6
#3  Subject A     3             0.6
#4  Subject A     4              NA
#5  Subject A     5              NA
#6  Subject A     6             0.4
#7  Subject A     7              NA
#8  Subject A     8              NA
#9  Subject A     9             0.3
#10 Subject A    10              NA
#11 Subject A    11              NA
#12 Subject A    12             0.2