我仍然不习惯使用data.table
的功能。我的目标是在使用多个变量进行分组时使用rle()
或rleid()
。 rle()
不是典型的摘要统计信息。
在下面的测试数据集中,我的目标是计算连续的重复记录,其中唯一的自行车(bike_id
)位于同一位置address
,然后按日期和{{1}进行分组}。
一些测试数据如下:
bike_id
我知道使用> dat
time bike_id address
1: 2017-11-22 15:45:34 1 Waters Rd
2: 2017-11-22 15:50:16 1 Waters Rd
3: 2017-11-22 16:00:03 1 Washington Ave
4: 2017-11-22 16:10:03 1 Washington Ave
5: 2017-11-22 16:20:02 1 Washington Ave
6: 2017-11-22 16:30:02 2 Shady Lane
7: 2017-11-22 16:40:03 2 Comstock Ave
8: 2017-11-22 16:50:02 2 Comstock Ave
9: 2017-11-22 17:00:02 2 Comstock Ave
10: 2017-11-22 17:10:02 2 Comstock Ave
11: 2017-11-22 17:20:03 3 Scranton Drive
12: 2017-11-22 17:30:03 3 Scranton Drive
13: 2017-11-22 17:40:03 3 Scranton Drive
14: 2017-11-22 17:50:03 3 Shady Lane
15: 2017-11-22 18:00:04 3 Scranton Drive
16: 2017-11-23 18:10:03 1 Shady Lane
17: 2017-11-23 18:20:03 1 Shady Lane
18: 2017-11-23 18:30:02 1 Shady Lane
19: 2017-11-23 18:40:03 1 Shady Lane
20: 2017-11-23 18:50:03 1 Shady Lane
21: 2017-11-23 19:00:03 2 Lovers Lane
22: 2017-11-23 19:10:02 2 Mulholland Drive
23: 2017-11-23 19:20:03 2 Mulholland Drive
24: 2017-11-23 19:30:02 2 Mulholland Drive
25: 2017-11-23 19:40:03 2 Mulholland Drive
time bike_id address
会在下面所需的输出中生成第三列,但我不确定如何在rle(dat$address)
rle()
进行分组
data.table
任何建议都会有所帮助!
以下是示例数据:
> output
date bike_id rle
1 2017-11-22 1 2
2 2017-11-22 1 3
3 2017-11-22 2 1
4 2017-11-22 2 4
5 2017-11-22 3 3
6 2017-11-22 3 1
7 2017-11-22 3 1
8 2017-11-23 1 5
9 2017-11-23 2 1
10 2017-11-23 2 4
编辑:
下面答案中的代码产生错误结果的唯一情况:
> dput(dat)
structure(list(time = structure(c(1511383534.43394, 1511383816.49785,
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895,
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818,
1511389203.52712, 1511389803.652, 1511390403.26619, 1511391003.79218,
1511391604.30061, 1511478603.55103, 1511479203.60366, 1511479802.97132,
1511480403.45374, 1511481003.12783, 1511481603.34055, 1511482202.62777,
1511482803.66405, 1511483402.83378, 1511484003.46605), tzone = "", class = c("POSIXct",
"POSIXt")), bike_id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), address = c("Waters Rd",
"Waters Rd", "Washington Ave", "Washington Ave", "Washington Ave",
"Shady Lane", "Comstock Ave", "Comstock Ave", "Comstock Ave",
"Comstock Ave", "Scranton Drive", "Scranton Drive", "Scranton Drive",
"Shady Lane", "Scranton Drive", "Shady Lane", "Shady Lane", "Shady Lane",
"Shady Lane", "Shady Lane", "Lovers Lane", "Mulholland Drive",
"Mulholland Drive", "Mulholland Drive", "Mulholland Drive")), .Names = c("time",
"bike_id", "address"), class = c("data.table", "data.frame"), row.names = c(NA,
-25L), .internal.selfref = <pointer: 0x10300d178>)
产生:
> dput(dat)
structure(list(bike_id = c(1, 1, 1, 1, 1, 1), lon = c(-76.968,
-76.968, -76.968, -72.141, -72.141, -72.141), lat = c(38.924,
38.924, 38.924, -39.219, -39.219, -39.219), time = structure(c(1511383534.49273,
1511383816.52327, 1511384403.97359, 1511385003.20305, 1511385602.50507,
1511299803.02598), tzone = "", class = c("POSIXct", "POSIXt"))), .Names = c("bike_id",
"lon", "lat", "time"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x10300d178>)
> dat
bike_id lon lat time
1: 1 -76.968 38.924 2017-11-22 15:45:34
2: 1 -76.968 38.924 2017-11-22 15:50:16
3: 1 -76.968 38.924 2017-11-22 16:00:03
4: 1 -72.141 -39.219 2017-11-22 16:10:03
5: 1 -72.141 -39.219 2017-11-22 16:20:02
6: 1 -72.141 -39.219 2017-11-21 16:30:03
> dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(lat, lon))][, grp := NULL][]
预期:
bike_id date n
1: 1 2017-11-22 3
2: 1 2017-11-22 3
答案 0 :(得分:6)
我们可以在container
rleid
data.table
如果每个分组变量(第二个数据)有多个“日期”,则前一个变量将仅选择第一个“日期”(dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]
)。假设,我们想要获得'日期',然后使用
[1]
但是,每组也有多行。如果我们每个组只需要一行,请创建一个dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22 3
#3: 1 2 2017-11-21 3
列(保留list
)
class
或dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22,2017-11-21 3
个paste
元素
根据OP的预期输出(来自第二个数据集)的帖子更新,我们还需要使用'date'作为分组变量
unique