Question

我有一个带停车票的数据集，他们的开始/结束时间以及他们购买地点的信息（组）。我需要进行时间序列分析，以创建将来何时何地购买门票的预测。为此，我需要将格式转换为时间序列格式，其中包含在给定时间点有效票数的值。

我的数据样本：

className<T>::clear()

我希望在（按组）聚合信息的时隙示例：

library(lubridate)
timeseries <- data.frame(start = c("2016-12-31 20:42:00",
                                   "2016-12-31 21:41:00",
                                   "2016-12-31 21:15:00",
                                   "2016-12-31 17:19:00",
                                   "2016-12-31 21:47:00",
                                   "2016-12-31 16:58:00"),
                         end = c("2016-12-31 23:07:00",
                                 "2016-12-31 23:07:00",
                                 "2016-12-31 23:08:00",
                                 "2016-12-31 23:09:00",
                                 "2016-12-31 23:11:00",
                                 "2016-12-31 23:11:00"),
                         group = c(1,2,1,2,1,2),
                         stringsAsFactors = FALSE)
timeseries$start <- as.POSIXlt(timeseries$start)
timeseries$end <- as.POSIXlt(timeseries$end)
timeseries$interval <- interval(timeseries$start, timeseries$end, tzone="UTC")

当数据集跨越两年时，我目前的方法似乎效率很低。

summary_hours <- data.frame(timeStart = c("2016-12-31 16:00",
                                          "2016-12-31 17:00",
                                          "2016-12-31 18:00",
                                          "2016-12-31 19:00",
                                          "2016-12-31 20:00",
                                          "2016-12-31 21:00",
                                          "2016-12-31 22:00",
                                          "2016-12-31 23:00"),
                            timeEnd = c("2016-12-31 17:00",
                                        "2016-12-31 18:00",
                                        "2016-12-31 19:00",
                                        "2016-12-31 20:00",
                                        "2016-12-31 21:00",
                                        "2016-12-31 22:00",
                                        "2016-12-31 23:00",
                                        "2017-01-01 00:00"))
summary_hours$timeStart <- as.POSIXlt(summary_hours$timeStart)
summary_hours$timeEnd <- as.POSIXlt(summary_hours$timeEnd)
summary_hours$interval <- interval(summary_hours$timeStart, summary_hours$timeEnd, tzone="UTC")

您是否有任何关于可以自动执行此类魔法的漂亮库的建议？

Answer 1

在他的评论here和here中，OP改变了问题的目标。现在，请求每隔一小时的时间间隔来聚集“活动门票的分钟数”。

这需要一种完全不同的方法，有理由发布一个单独的答案，恕我直言。

要检查哪些票证在一小时的时间间隔内有效，可以使用<md-card> <md-card-content> <h2 class="example-h2">Slider configuration</h2> <section class="example-section"> <label class="example-margin">Color:</label> <md-radio-group [(ngModel)]="color"> <md-radio-button class="example-margin" value="primary"> Primary </md-radio-button> <md-radio-button class="example-margin" value="accent"> Accent </md-radio-button> <md-radio-button class="example-margin" value="warn"> Warn </md-radio-button> </md-radio-group> </section> <section class="example-section"> <md-checkbox class="example-margin" [(ngModel)]="checked">Checked</md-checkbox> </section> <section class="example-section"> <md-checkbox class="example-margin" [(ngModel)]="disabled">Disabled</md-checkbox> </section> </md-card-content> </md-card> <md-card class="result"> <md-card-content> <h2 class="example-h2">Result</h2> <section class="example-section"> <md-slide-toggle class="example-margin" [color]="color" [checked]="checked" [disabled]="disabled"> Slide me! </md-slide-toggle> </section> </md-card-content> </md-card>包中的foverlaps()功能：

data.table

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]

# create sequence of intervals of one hour covering all given times
hours_seq <- timeseries[, {
  tmp <- seq(lubridate::floor_date(min(start, end), "hour"),
             lubridate::ceiling_date(max(start, end), "hour"), 
             by = "1 hour")
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }]
hours_seq

                 start                 end
1: 2016-12-31 16:00:00 2016-12-31 17:00:00
2: 2016-12-31 17:00:00 2016-12-31 18:00:00
3: 2016-12-31 18:00:00 2016-12-31 19:00:00
4: 2016-12-31 19:00:00 2016-12-31 20:00:00
5: 2016-12-31 20:00:00 2016-12-31 21:00:00
6: 2016-12-31 21:00:00 2016-12-31 22:00:00
7: 2016-12-31 22:00:00 2016-12-31 23:00:00
8: 2016-12-31 23:00:00 2017-01-01 00:00:00

# split up given ticket intervals in hour pieces 
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  # compute active minutes and aggregate
  , .(cnt_active_tickets = .N, 
      sum_active_minutes = sum(as.integer(
        difftime(pmin(end, i.end), pmax(start, i.start), units = "mins")))), 
    keyby = .(group, interval_start = i.start, interval_end = i.end)]

请注意，此方法还会考虑“短期赌客”，即活动时间不到一小时并在整整一小时后开始并在下一个整整一小时之前结束的门票。

宽幅输出

如果结果应与每个group interval_start interval_end cnt_active_tickets sum_active_minutes 1: 1 2016-12-31 20:00:00 2016-12-31 21:00:00 1 18 2: 1 2016-12-31 21:00:00 2016-12-31 22:00:00 3 118 3: 1 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 4: 1 2016-12-31 23:00:00 2017-01-01 00:00:00 3 26 5: 2 2016-12-31 16:00:00 2016-12-31 17:00:00 1 2 6: 2 2016-12-31 17:00:00 2016-12-31 18:00:00 2 101 7: 2 2016-12-31 18:00:00 2016-12-31 19:00:00 2 120 8: 2 2016-12-31 19:00:00 2016-12-31 20:00:00 2 120 9: 2 2016-12-31 20:00:00 2016-12-31 21:00:00 2 120 10: 2 2016-12-31 21:00:00 2016-12-31 22:00:00 3 139 11: 2 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 12: 2 2016-12-31 23:00:00 2017-01-01 00:00:00 3 27的值并排显示，则可以使用group将数据从长格式转换为宽格式：

dcast()

foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  , active_minutes := as.integer(
    difftime(pmin(end, i.end), pmax(start, i.start), units = "mins"))][
      , dcast(.SD, i.start + i.end ~ paste0("group", group), sum)]

Answer 2

OP要求计算在给定时间点有效票数。

这可以使用non-equi join的开始日期和结束日期以及固定的每小时时间点的连续序列来实现：

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")

# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# add id to each row (required to count the active tickets later)
timeseries[, rn := .I]
# print data for ilustration
timeseries[order(group, start, end)]

                 start                 end group rn
1: 2016-12-31 20:42:00 2016-12-31 23:07:00     1  1
2: 2016-12-31 21:15:00 2016-12-31 23:08:00     1  3
3: 2016-12-31 21:47:00 2016-12-31 23:11:00     1  5
4: 2016-12-31 16:58:00 2016-12-31 23:11:00     2  6
5: 2016-12-31 17:19:00 2016-12-31 23:09:00     2  4
6: 2016-12-31 21:41:00 2016-12-31 23:07:00     2  2

# create sequence of hourly timepoints
hours_seq <- timeseries[, seq(lubridate::floor_date(min(start, end), "hour"),
                              lubridate::ceiling_date(max(start, end), "hour"), 
                              by = "1 hour")]
hours_seq

[1] "2016-12-31 16:00:00 UTC" "2016-12-31 17:00:00 UTC" "2016-12-31 18:00:00 UTC" "2016-12-31 19:00:00 UTC"
[5] "2016-12-31 20:00:00 UTC" "2016-12-31 21:00:00 UTC" "2016-12-31 22:00:00 UTC" "2016-12-31 23:00:00 UTC"
[9] "2017-01-01 00:00:00 UTC"

# non-equi join
timeseries[.(hr = hours_seq), on = .(start <= hr, end > hr), nomatch = 0L,
           allow.cartesian = TRUE][
             # count number of active tickets at timepoint and by group
             , .(n.active.tickets = uniqueN(rn)), keyby = .(group, timepoint = start)]

    group           timepoint n.active.tickets
 1:     1 2016-12-31 21:00:00                1
 2:     1 2016-12-31 22:00:00                3
 3:     1 2016-12-31 23:00:00                3
 4:     2 2016-12-31 17:00:00                1
 5:     2 2016-12-31 18:00:00                2
 6:     2 2016-12-31 19:00:00                2
 7:     2 2016-12-31 20:00:00                2
 8:     2 2016-12-31 21:00:00                2
 9:     2 2016-12-31 22:00:00                3
10:     2 2016-12-31 23:00:00                3

按小时聚合时间序列

2 个答案:

宽幅输出