dplyr group_by和迭代循环计算

时间:2017-09-13 19:20:19

标签: r dplyr

我正在尝试对依赖于组中两个先前元素的分组数据执行迭代计算。作为一个玩具的例子:

set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
            Year = c(2001:2005, 2001:2004), 
            Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
            Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
     ID Year Price Index

R> df
1: A_index1 2001    NA   1.5
2: A_index1 2002    NA   1.5
3: A_index1 2003    10   2.0
4: A_index1 2004    NA   1.0
5: A_index1 2005    NA   2.0
6: A_index1 2006    15   2.0
7: A_index1 2007    NA   3.0
8: A_index1 2008    13   1.5
9: A_index1 2009    NA   2.0

目标是使用最后可用价格和要调整的指数来填补缺失的价格。我有一个执行这些计算的循环,我试图使用dplyr进行矢量化。

我的逻辑在以下循环中定义:

df$Price_adj = df$Price
for (i in 2:nrow(df)) {
  if (is.na(df$Price[i])) {
    df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
  }
}

R> df
         ID Year Price Index Price_adj
1: A_index1 2001    NA   1.5        NA
2: A_index1 2002    NA   1.5        NA
3: A_index1 2003    10   2.0     10.00
4: A_index1 2004    NA   1.0      5.00
5: A_index1 2005    NA   2.0     10.00
6: A_index1 2006    15   2.0     15.00
7: A_index1 2007    NA   3.0     22.50
8: A_index1 2008    13   1.5     13.00
9: A_index1 2009    NA   2.0     17.33

在我的实际大数据中,我将不得不将此功能应用于多个组,速度是一个考虑因素。我的尝试在下面,需要帮助指出我正确的方向。我确实考虑了Reduce,但不确定它如何在组中包含两个先前的元素。

foo = function(Price, Index){
  for (i in 2:nrow(df)) {
    if (is.na(df$Price[i])) {
      df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
    }
  }
}

df %>% 
  group_by(ID) %>% 
  mutate(Price_adj = Price,
         Price_adj = foo(Price, Index))

1 个答案:

答案 0 :(得分:3)

cumprod的一个选项:

df %>% 
    # group data frame into chunks starting from non na price
    group_by(ID, g = cumsum(!is.na(Price))) %>% 
    # for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
    mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% 
    ungroup() %>% select(-g)

# A tibble: 9 x 5
#        ID  Year Price Index Price_adj
#    <fctr> <int> <dbl> <dbl>     <dbl>
#1 A_index1  2001    NA   1.5        NA
#2 A_index1  2002    NA   1.5        NA
#3 A_index1  2003    10   2.0     10.00
#4 A_index1  2004    NA   1.0      5.00
#5 A_index1  2005    NA   2.0     10.00
#6 A_index1  2001    15   2.0     15.00
#7 A_index1  2002    NA   3.0     22.50
#8 A_index1  2003    13   1.5     13.00
#9 A_index1  2004    NA   2.0     17.33
  • IDcumsum(!is.na(Price))分组数据框,字母将数据框拆分成块,每个块以非NA价格开头;

  • first(Price) * cumprod(Index / lag(Index, default=first(Index)))执行迭代计算,如果您将Price_adj[i-1]替换为Price_adj[i-2]直到它Price_adj[1],则等效于问题中给出的公式}或first(Price);

警告:如果你有很多NA块,可能效率不高。

如果速度是主要考虑因素,您可以使用Rcpp包编写函数:

library(Rcpp)
cppFunction("
    NumericVector price_adj(NumericVector price, NumericVector index) {
        int n = price.size();
        NumericVector adjusted_price(n);
        adjusted_price[0] = price[0];
        for (int i = 1; i < n; i++) {
            if(NumericVector::is_na(price[i])) {
                adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
            } else {
                adjusted_price[i] = price[i];
            }
        }
        return adjusted_price;
    }")

现在使用cpp函数和dplyr,如下所示:

cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))

cpp_fun()
# A tibble: 9 x 5
# Groups:   ID [1]
#        ID  Year Price Index Price_adj
#    <fctr> <int> <dbl> <dbl>     <dbl>
#1 A_index1  2001    NA   1.5        NA
#2 A_index1  2002    NA   1.5        NA
#3 A_index1  2003    10   2.0     10.00
#4 A_index1  2004    NA   1.0      5.00
#5 A_index1  2005    NA   2.0     10.00
#6 A_index1  2001    15   2.0     15.00
#7 A_index1  2002    NA   3.0     22.50
#8 A_index1  2003    13   1.5     13.00
#9 A_index1  2004    NA   2.0     17.33

基准

r_fun定义为:

r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)

在小样本数据上,已经存在差异:

microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
#      expr       min        lq      mean    median        uq        max neval
#   r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975   100
# cpp_fun()  3.191278  3.308758  3.738809  3.491495  3.937006   6.627019   100

在稍大的数据框架上进行测试:

df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000     4

microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
#      expr        min         lq      mean    median        uq       max neval
#   r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576    10
# cpp_fun()   8.722794   8.888667  10.67781  10.86399  12.10647  13.68302    10

身份测试

identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE