我希望为超出阈值的多个实例站点进行滚动计数。
我的数据的简化版本:
Dates SiteID Value
1 2015-04-01 A 9.1
2 2015-04-02 A 8.8
3 2015-04-02 A 7.9
4 2015-04-03 A 9.2
5 2015-04-08 A 9.3
6 2015-04-11 A 8.9
7 2015-04-11 A 9.2
8 2015-04-13 A 9.1
9 2015-04-16 A 7.8
10 2015-04-01 B 9.1
11 2015-04-01 B 8.8
12 2015-04-04 B 9.9
13 2015-04-05 B 7.8
14 2015-04-06 B 9.8
15 2015-04-06 B 9.2
16 2015-04-07 B 9.1
17 2015-04-08 B 8.5
18 2015-04-15 B 9.1
如果滚动期为3天,且值为'值'是9,我正在寻找一个新专栏' Exceedances',它计算次数'价值'在给定的SiteID'中,在过去3天内大于9。所以这看起来像:
Dates SiteID Value Exceedances
1 2015-04-01 A 9.1 1
2 2015-04-02 A 8.8 1
3 2015-04-02 A 7.9 1
4 2015-04-03 A 9.2 2
5 2015-04-08 A 9.3 1
6 2015-04-11 A 8.9 0
7 2015-04-11 A 9.2 1
8 2015-04-13 A 9.1 2
9 2015-04-16 A 7.8 0
10 2015-04-01 B 9.1 1
11 2015-04-01 B 8.8 1
12 2015-04-04 B 9.9 1
13 2015-04-05 B 7.8 1
14 2015-04-06 B 9.8 2
15 2015-04-06 B 9.2 3
16 2015-04-07 B 9.1 3
17 2015-04-08 B 8.5 3
18 2015-04-15 B 9.1 1
DT = structure(list(r = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), Dates = structure(c(16526, 16527,
16527, 16528, 16533, 16536, 16536, 16538, 16541, 16526, 16526,
16529, 16530, 16531, 16531, 16532, 16533, 16540), class = "Date"),
SiteID = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "B", "B", "B"), Value = c(9.1, 8.8,
7.9, 9.2, 9.3, 8.9, 9.2, 9.1, 7.8, 9.1, 8.8, 9.9, 7.8, 9.8,
9.2, 9.1, 8.5, 9.1), Exceedances = c(1L, 1L, 1L, 2L, 1L,
0L, 1L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L)), .Names = c("r",
"Dates", "SiteID", "Value", "Exceedances"), row.names = c(NA,
-18L), class = "data.frame")
我见过类似的问题,使用data.table和deplyr,但没有一个问题涉及计算阈值超标。
最终,这将应用于非常大的数据集,因此将会感谢最快的方法。如果这对建议产生影响,我也会将其应用于滚动年份,而不是上面的3天示例,数据集将包含“NA”。
答案 0 :(得分:3)
由于行号似乎很重要,我们可以将其添加为列:
library(data.table)
setDT(DT)
DT[, r := rowid(SiteID)]
setcolorder(DT, c("r", setdiff(names(DT), "r")))
然后你可以做一个非equi连接来计算符合标准的行:
DT[, v :=
DT[.(SiteID = SiteID, rtop = r, d0 = Dates - 3, d1 = Dates),
on=.(SiteID, r <= rtop, Dates > d0, Dates <= d1),
sum(Value > 9), by=.EACHI]$V1
]
r Dates SiteID Value Exceedances v
1: 1 2015-04-01 A 9.1 1 1
2: 2 2015-04-02 A 8.8 1 1
3: 3 2015-04-02 A 7.9 1 1
4: 4 2015-04-03 A 9.2 2 2
5: 5 2015-04-08 A 9.3 1 1
6: 6 2015-04-11 A 8.9 0 0
7: 7 2015-04-11 A 9.2 1 1
8: 8 2015-04-13 A 9.1 2 2
9: 9 2015-04-16 A 7.8 0 0
10: 1 2015-04-01 B 9.1 1 1
11: 2 2015-04-01 B 8.8 1 1
12: 3 2015-04-04 B 9.9 1 1
13: 4 2015-04-05 B 7.8 1 1
14: 5 2015-04-06 B 9.8 2 2
15: 6 2015-04-06 B 9.2 3 3
16: 7 2015-04-07 B 9.1 3 3
17: 8 2015-04-08 B 8.5 3 3
18: 9 2015-04-15 B 9.1 1 1
这里有一些潜在的问题:
uniqueN(x.Dates[Value > 9])
而不是sum(Value > 9)
。r
和rtop
。关于它的工作原理,可以查看the vignettes和我的answer to a similar question here。
答案 1 :(得分:1)
我们可以使用sqldf将问题表示为复杂的左连接:
library(sqldf)
sqldf("select a.*, sum(b.Value > 9) exceed
from DT a
left join DT b on a.SiteID = b.SITEID and
b.Dates > a.Dates - 3 and
b.rowid <= a.rowid
group by a.rowid")
,并提供:
Dates SiteID Value exceed
1 2015-04-01 A 9.1 1
2 2015-04-02 A 8.8 1
3 2015-04-02 A 7.9 1
4 2015-04-03 A 9.2 2
5 2015-04-08 A 9.3 1
6 2015-04-11 A 8.9 0
7 2015-04-11 A 9.2 1
8 2015-04-13 A 9.1 2
9 2015-04-16 A 7.8 0
10 2015-04-01 B 9.1 1
11 2015-04-01 B 8.8 1
12 2015-04-04 B 9.9 1
13 2015-04-05 B 7.8 1
14 2015-04-06 B 9.8 2
15 2015-04-06 B 9.2 3
16 2015-04-07 B 9.1 3
17 2015-04-08 B 8.5 3
18 2015-04-15 B 9.1 1
答案 2 :(得分:1)
以下是使用data.table
的答案。简单,可能很快。它使用shift
来获取前两行&#39}。 Value
,将NA
更改为0(每组中的前两个),为&lt; 9提供1,为&lt; 9提供0,然后添加它们(包括当前行的1或0) )。
library(data.table)
setDT(dt)
dt[, newCol := ifelse(shift(Value, n=1, fill=0)>9, 1,0)+ ifelse(shift(Value, n=2, fill=0)>=, 1, 0)+ ifelse(Value>9, 1, 0), by=SiteID]
根据弗兰克的评论:
dt[, newCol := (shift(Value, n=1, fill=0)>9)+ (shift(Value, n=2, fill=0)>9) + (Value>9), by=SiteID]
也有效
答案 3 :(得分:1)
考虑到“日期”栏目顺序重要的事实,似乎是:
thres = 9; n = 3
do.call(rbind, lapply(split(DT, DT$SiteID),
function(d) {
cs = cumsum(d$Value >= thres)
i = findInterval(d$Dates - (n - 1), d$Dates, left.open = TRUE)
cbind(d, exceed = cs - c(rep_len(0, sum(!i)), cs[i]))
}))
# r Dates SiteID Value Exceedances exceed
#A.1 1 2015-04-01 A 9.1 1 1
#A.2 2 2015-04-02 A 8.8 1 1
#A.3 3 2015-04-02 A 7.9 1 1
#A.4 4 2015-04-03 A 9.2 2 2
#A.5 5 2015-04-08 A 9.3 1 1
#A.6 6 2015-04-11 A 8.9 0 0
#A.7 7 2015-04-11 A 9.2 1 1
#A.8 8 2015-04-13 A 9.1 2 2
#A.9 9 2015-04-16 A 7.8 0 0
#B.10 1 2015-04-01 B 9.1 1 1
#B.11 2 2015-04-01 B 8.8 1 1
#B.12 3 2015-04-04 B 9.9 1 1
#B.13 4 2015-04-05 B 7.8 1 1
#B.14 5 2015-04-06 B 9.8 2 2
#B.15 6 2015-04-06 B 9.2 3 3
#B.16 7 2015-04-07 B 9.1 3 3
#B.17 8 2015-04-08 B 8.5 3 3
#B.18 9 2015-04-15 B 9.1 1 1