运行长度编码和

时间:2017-12-30 04:23:03

标签: r dplyr data.table

我仍然不习惯使用data.table的功能。我的目标是在使用多个变量进行分组时使用rle()rleid()rle()不是典型的摘要统计信息。

在下面的测试数据集中,我的目标是计算连续的重复记录,其中唯一的自行车(bike_id)位于同一位置address,然后按日期和{{1}进行分组}。

一些测试数据如下:

bike_id

我知道使用> dat time bike_id address 1: 2017-11-22 15:45:34 1 Waters Rd 2: 2017-11-22 15:50:16 1 Waters Rd 3: 2017-11-22 16:00:03 1 Washington Ave 4: 2017-11-22 16:10:03 1 Washington Ave 5: 2017-11-22 16:20:02 1 Washington Ave 6: 2017-11-22 16:30:02 2 Shady Lane 7: 2017-11-22 16:40:03 2 Comstock Ave 8: 2017-11-22 16:50:02 2 Comstock Ave 9: 2017-11-22 17:00:02 2 Comstock Ave 10: 2017-11-22 17:10:02 2 Comstock Ave 11: 2017-11-22 17:20:03 3 Scranton Drive 12: 2017-11-22 17:30:03 3 Scranton Drive 13: 2017-11-22 17:40:03 3 Scranton Drive 14: 2017-11-22 17:50:03 3 Shady Lane 15: 2017-11-22 18:00:04 3 Scranton Drive 16: 2017-11-23 18:10:03 1 Shady Lane 17: 2017-11-23 18:20:03 1 Shady Lane 18: 2017-11-23 18:30:02 1 Shady Lane 19: 2017-11-23 18:40:03 1 Shady Lane 20: 2017-11-23 18:50:03 1 Shady Lane 21: 2017-11-23 19:00:03 2 Lovers Lane 22: 2017-11-23 19:10:02 2 Mulholland Drive 23: 2017-11-23 19:20:03 2 Mulholland Drive 24: 2017-11-23 19:30:02 2 Mulholland Drive 25: 2017-11-23 19:40:03 2 Mulholland Drive time bike_id address 会在下面所需的输出中生成第三列,但我不确定如何在rle(dat$address)

中使用rle()进行分组
data.table

任何建议都会有所帮助!

以下是示例数据:

> output
         date bike_id rle
1  2017-11-22       1   2
2  2017-11-22       1   3
3  2017-11-22       2   1
4  2017-11-22       2   4
5  2017-11-22       3   3
6  2017-11-22       3   1
7  2017-11-22       3   1
8  2017-11-23       1   5
9  2017-11-23       2   1
10 2017-11-23       2   4

编辑:

下面答案中的代码产生错误结果的唯一情况:

> dput(dat)
structure(list(time = structure(c(1511383534.43394, 1511383816.49785, 
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895, 
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818, 
1511389203.52712, 1511389803.652, 1511390403.26619, 1511391003.79218, 
1511391604.30061, 1511478603.55103, 1511479203.60366, 1511479802.97132, 
1511480403.45374, 1511481003.12783, 1511481603.34055, 1511482202.62777, 
1511482803.66405, 1511483402.83378, 1511484003.46605), tzone = "", class = c("POSIXct", 
"POSIXt")), bike_id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), address = c("Waters Rd", 
"Waters Rd", "Washington Ave", "Washington Ave", "Washington Ave", 
"Shady Lane", "Comstock Ave", "Comstock Ave", "Comstock Ave", 
"Comstock Ave", "Scranton Drive", "Scranton Drive", "Scranton Drive", 
"Shady Lane", "Scranton Drive", "Shady Lane", "Shady Lane", "Shady Lane", 
"Shady Lane", "Shady Lane", "Lovers Lane", "Mulholland Drive", 
"Mulholland Drive", "Mulholland Drive", "Mulholland Drive")), .Names = c("time", 
"bike_id", "address"), class = c("data.table", "data.frame"), row.names = c(NA, 
-25L), .internal.selfref = <pointer: 0x10300d178>)

产生:

> dput(dat)
structure(list(bike_id = c(1, 1, 1, 1, 1, 1), lon = c(-76.968, 
-76.968, -76.968, -72.141, -72.141, -72.141), lat = c(38.924, 
38.924, 38.924, -39.219, -39.219, -39.219), time = structure(c(1511383534.49273, 
1511383816.52327, 1511384403.97359, 1511385003.20305, 1511385602.50507, 
1511299803.02598), tzone = "", class = c("POSIXct", "POSIXt"))), .Names = c("bike_id", 
"lon", "lat", "time"), row.names = c(NA, -6L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x10300d178>)

> dat
   bike_id     lon     lat                time
1:       1 -76.968  38.924 2017-11-22 15:45:34
2:       1 -76.968  38.924 2017-11-22 15:50:16
3:       1 -76.968  38.924 2017-11-22 16:00:03
4:       1 -72.141 -39.219 2017-11-22 16:10:03
5:       1 -72.141 -39.219 2017-11-22 16:20:02
6:       1 -72.141 -39.219 2017-11-21 16:30:03

> dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(lat, lon))][, grp := NULL][]

预期:

   bike_id       date n
1:       1 2017-11-22 3
2:       1 2017-11-22 3

1 个答案:

答案 0 :(得分:6)

我们可以在container

中使用rleid
data.table

如果每个分组变量(第二个数据)有多个“日期”,则前一个变量将仅选择第一个“日期”(dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][] )。假设,我们想要获得'日期',然后使用

[1]

但是,每组也有多行。如果我们每个组只需要一行,请创建一个dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))] # bike_id grp date n #1: 1 1 2017-11-22 3 #2: 1 2 2017-11-22 3 #3: 1 2 2017-11-21 3 列(保留list

class

dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))] # bike_id grp date n #1: 1 1 2017-11-22 3 #2: 1 2 2017-11-22,2017-11-21 3 paste元素

更新

根据OP的预期输出(来自第二个数据集)的帖子更新,我们还需要使用'date'作为分组变量

unique
相关问题