笛卡儿加入两个R data.table并取最小值

时间:2018-01-05 16:07:12

标签: r join data.table

我在R,街道和崩溃中有两个data.table R对象。在描述下面:

head(streets)
  link_id      Lat     Long
1:  706815684 44.13163  9.84736
2:  572513298 46.87760 15.77544
3:  974462021 41.86439 16.04506
4:  906821226 43.30472 11.59198
5:  537724528 46.30359  7.59026
6: 1062652524 44.83993 19.08552

head(crashes)
ID_SX      Lat     Long
1: rca89123 45.35955  9.64950
2: rca89654 37.07544 15.28659
3: rca83674 44.42947  8.89526
4: lcg55792 38.08756 13.53466
5: lcg11992 41.81531 12.45126
6: iix21744 38.02655 12.88128

我想附加到崩溃数据集中,来自街道data.framewhere的link_id是最小的距离(来自R geospere包):

我试图使用此代码段,但失败了:

temp=crashes[streets(hdist=geosphere::distm(c(x.Long,x.Lat),c(i.Long,i.Lat),fun=distHaversine)),allow.cartesian=T]

请注意,街道数据集非常大(大约9Mln行),而崩溃非常小(大约400行)。我相信,在R中,只有data.table可以很好地处理这个问题,但不知道如何......

提前感谢您的支持

1 个答案:

答案 0 :(得分:1)

为了避免9 M行x 400行的笛卡尔连接,我们可以尝试使用 non-equi join 来缩小候选列表。

这个想法是缩小附近区域的范围。对于每个崩溃站点,通过选择LatLong在每个崩溃站点周围的给定增量内的街道。然后,我们只需计算附近街道的距离,找到最小距离。

这是我尝试使用提供的数据:

library(data.table)
# define +/- deltas for non-equi join ("area of vicinity")
d_lat <- 2.0
d_lon <- 2.0
streets[crashes[, .(ID_SX, Lat, Long,
                    # create lower and upper bounds
                    lb.lat = Lat - d_lat, ub.lat = Lat + d_lat, 
                    lb.lon = Long - d_lon, ub.lon = Long + d_lon)],
        # non-equi join conditions
        on = .(Lat > lb.lat, Lat < ub.lat, Long > lb.lon, Long < ub.lon), 
        .(link_id, x.Lat, x.Long, ID_SX, i.Lat, i.Long)][
          # compute distance for each row
          , hdist := geosphere::distm(c(x.Long,x.Lat),c(i.Long,i.Lat),fun=distHaversine),
          by = .(link_id, ID_SX)][
            # find minimum for each crash site
            , .SD[which.min(hdist)], by = ID_SX]
      ID_SX   link_id    x.Lat   x.Long    i.Lat   i.Long     hdist
1: rca89123 706815684 44.13163  9.84736 45.35955  9.64950 137583.53
2: rca83674 706815684 44.13163  9.84736 44.42947  8.89526  82806.14
3: lcg11992 906821226 43.30472 11.59198 41.81531 12.45126 180146.65

请注意,并非所有崩溃站点都在附近的区域内找到街道&#34;。这是由少数街道造成的。

出于生产目的,需要调整d_latd_lon(尽可能小以减少运行时间和内存消耗,但需要尽可能大,以便为每个崩溃站点查找街道)。

数据

library(data.table)
streets <- fread(
  "i link_id      Lat     Long
1:  706815684 44.13163  9.84736
2:  572513298 46.87760 15.77544
3:  974462021 41.86439 16.04506
4:  906821226 43.30472 11.59198
5:  537724528 46.30359  7.59026
6: 1062652524 44.83993 19.08552", drop = 1L)
crashes <- fread(
  "i ID_SX      Lat     Long
  1: rca89123 45.35955  9.64950
  2: rca89654 37.07544 15.28659
  3: rca83674 44.42947  8.89526
  4: lcg55792 38.08756 13.53466
  5: lcg11992 41.81531 12.45126
  6: iix21744 38.02655 12.88128", drop = 1L)