有条件地使用辅助阈值水平

时间:2018-12-18 10:38:39

标签: r data.table

我尝试为遇到的以下问题找到解决方案,但是我使用的数据集比较庞大,因此我尝试避免出现很多循环等。我有两个标识符var1和var2,它们与日期结合在一起是独特的。此外,我有var3,它是介于0.5(0.5是阈值)和无穷大之间的值。我尝试为var1和var2的每种组合计算var3中从一个日期到另一个日期的变化,这是我使用下面的代码行完成的,其工作原理就像一个魅力:

test = test[, test_change := var3 - shift(var3, type = "lag", n = 1), by = c("var1", "var2")]

但是,对于var3在“ 2016-01-01”日期已经高于阈值0.5的情况,结果是不正确的,在这种情况下,我想使用“ 2016-01-以“ 01”作为阈值,直到降至或低于0.5阈值。仅当开始日期为“ 2016-01-01”时才需要这样做。此外,该变化不能大于该值与阈值之间的距离,因此省略了它下降到阈值以下的部分,如在第5行中,对于(a,X),var3从1.5下降到0.6,但临时阈值是1,因此更改应等于-0.5。

数据

test = data.table(Date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-01","2016-01-3", "2016-01-05", "2016-01-05", "2016-01-06", "2016-01-06", "2016-01-07")), var1 = c("a", "a", "b","a", "a", "a", "b", "a", "a"), var2 = c("X", "Y","X", "X", "X", "Y", "X", "X", "X"), var3 = c(1,0.75,0.5,1.5, 0.6,1.2, 0.55, 0.50, 0.75))

> test
         Date var1 var2 var3
1: 2016-01-01    a    X 1.00
2: 2016-01-01    a    Y 0.75
3: 2016-01-01    b    X 0.50
4: 2016-01-03    a    X 1.50
5: 2016-01-05    a    X 0.60
6: 2016-01-05    a    Y 1.20
7: 2016-01-06    b    X 0.55
8: 2016-01-06    a    X 0.50
9: 2016-01-07    a    X 0.75

预期结果

test = data.table(Date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-01","2016-01-3", "2016-01-05", "2016-01-05", "2016-01-06", "2016-01-06", "2016-01-07")), var1 = c("a", "a", "b","a", "a", "a", "b", "a", "a"), var2 = c("X", "Y","X", "X", "X", "Y", "X", "X", "X"), var3 = c(1,0.75,0.5,1.5, 0.6,1.2, 0.55, 0.50, 0.75), change_var3 = c(0,0,0,0.5,-0.5,0.45,0.05,0,0.25))

> test
         Date var1 var2 var3 change_var3
1: 2016-01-01    a    X 1.00        0.00
2: 2016-01-01    a    Y 0.75        0.00
3: 2016-01-01    b    X 0.50        0.00
4: 2016-01-03    a    X 1.50        0.50
5: 2016-01-05    a    X 0.60       -0.50
6: 2016-01-05    a    Y 1.20        0.45
7: 2016-01-06    b    X 0.55        0.05
8: 2016-01-06    a    X 0.50        0.00
9: 2016-01-07    a    X 0.75        0.25

非常感谢您的帮助

2 个答案:

答案 0 :(得分:0)

我希望以正确的方式了解您的情况。

我所做的主要更改是创建移位变量作为要使用的其他列,然后在给定条件下计算延迟。
我假设var3的第一个给定值是用于按组比较数据的临时阈值,因此它是滞后变量的NA值。 然后,我使用您的其他条件更新了change列:如果var3小于某个阈值或它是第一个值,请将其设置为0。

test = data.table(
  Date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-01","2016-01-3", "2016-01-05", "2016-01-05", "2016-01-06", "2016-01-06", "2016-01-07")), 
  var1 = c("a", "a", "b","a", "a", "a", "b", "a", "a"), 
  var2 = c("X", "Y","X", "X", "X", "Y", "X", "X", "X"), 
  var3 = c(1,0.75,0.5,1.5, 0.6,1.2, 0.55, 0.50, 0.75), 
  change_var3 = c(0,0,0,0.5,-0.5,0.45,0.05,0,0.25))

test[, var3_lag := c(NA, var3[-.N]), by = c("var1", "var2")]
test[, test_change := ifelse(var3_lag > var3[is.na(var3_lag)], 
                              var3[is.na(var3_lag)] - var3_lag, 
                              var3 - var3_lag), 
     by = c("var1", "var2")]

test[is.na(var3_lag) | var3 <= 0.5, test_change := 0]

结果为:

> test
         Date var1 var2 var3 change_var3 var3_lag test_change
1: 2016-01-01    a    X 1.00        0.00       NA        0.00
2: 2016-01-01    a    Y 0.75        0.00       NA        0.00
3: 2016-01-01    b    X 0.50        0.00       NA        0.00
4: 2016-01-03    a    X 1.50        0.50     1.00        0.50
5: 2016-01-05    a    X 0.60       -0.50     1.50       -0.50
6: 2016-01-05    a    Y 1.20        0.45     0.75        0.45
7: 2016-01-06    b    X 0.55        0.05     0.50        0.05
8: 2016-01-06    a    X 0.50        0.00     0.60        0.00
9: 2016-01-07    a    X 0.75        0.25     0.50        0.25

这是您需要的吗?

答案 1 :(得分:0)

我能够解决自己的问题,希望我可以帮助其他人解决我的问题。

library(data.table)
test = data.table(Date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-01","2016-01-3", "2016-01-05", "2016-01-05", "2016-01-06", "2016-01-06", "2016-01-07","2016-01-08")), var1 = c("a", "a", "b","a", "a", "a", "b", "a", "a", "a"), var2 = c("X", "Y","X", "X", "X", "Y", "X", "X", "X", "X"), var3 = c(1,0.75,0.5,1.5, 0.6,1.2, 0.55, 0.50, 0.75, 0.4))
test[var3 <= 0.5, var3 := 0.5]
test[, test_threshold := ifelse(Date == "2016-01-01", var3, NA)]
test[, test :=  ifelse(var3 > 0.5 & (shift(var3, n = 1, type = "lag")> 0.5 |is.na(shift(var3, n = 1, type = "lag")) == TRUE) , test_threshold[1], 0.5), by = c("var1", "var2")]
test[, var5 := var3 - test]
test[var5 < 0, var5 := 0]
test[, var5_change := var5 - shift(var5, n = 1, type = "lag"),
     by = c("var1", "var2")]

> test
          Date var1 var2 var3 test_threshold test var5 var5_change
 1: 2016-01-01    a    X 1.00           1.00 1.00 0.00          NA
 2: 2016-01-01    a    Y 0.75           0.75 0.75 0.00          NA
 3: 2016-01-01    b    X 0.50           0.50 0.50 0.00          NA
 4: 2016-01-03    a    X 1.50             NA 1.00 0.50        0.50
 5: 2016-01-05    a    X 0.60             NA 1.00 0.00       -0.50
 6: 2016-01-05    a    Y 1.20             NA 0.75 0.45        0.45
 7: 2016-01-06    b    X 0.55             NA 0.50 0.05        0.05
 8: 2016-01-06    a    X 0.50             NA 0.50 0.00        0.00
 9: 2016-01-07    a    X 0.75             NA 0.50 0.25        0.25
10: 2016-01-08    a    X 0.50             NA 0.50 0.00       -0.25