Question

我目前有一个大表mivehdetailedtrajectory（25B行）和一个小表cell_data_tower（400行），我需要使用PostGIS加入。具体来说，我需要运行此查询：

SELECT COUNT(traj.*), tower.id
FROM cell_data_tower tower LEFT OUTER JOIN mivehdetailedtrajectory traj
ON ST_Contains(tower.geom, traj.location)
GROUP BY tower.id
ORDER BY tower.id;

它无法写入磁盘而感到愤怒。对于SELECT来说这看起来很奇怪所以我运行了EXPLAIN：注意：gserialized_gist_joinsel：不支持jointype 1

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sort  (cost=28905094882.25..28905094883.25 rows=400 width=120)
   Sort Key: tower.id
   ->  HashAggregate  (cost=28905094860.96..28905094864.96 rows=400 width=120)
         ->  Nested Loop Left Join  (cost=0.00..28904927894.80 rows=33393232 width=120)
               Join Filter: ((tower.geom && traj.location) AND _st_contains(tower.geom, traj.location))
               ->  Seq Scan on cell_data_tower tower  (cost=0.00..52.00 rows=400 width=153)
               ->  Materialize  (cost=0.00..15839886.96 rows=250449264 width=164)
                     ->  Seq Scan on mivehdetailedtrajectory traj  (cost=0.00..8717735.64 rows=250449264 width=164)

我不明白为什么postgres认为它应该实现内部表。另外，一般来说，我并不理解这个计划。似乎它应该将cell_data_tower表保留在内存中并迭代mivehdetailedtrajectory表。关于我如何优化这一点的任何想法（a）运行，（b）在合理的时间内这样做。具体来说，似乎这应该可以在不到1天内完成。

编辑：Postgres版本9.3

Answer 1

需要大量内存的查询是相关子查询执行得更好的罕见地方（LATERAL JOIN也应该有效，但那些超出我的范围）。另请注意，您没有选择tower.id，因此您的结果不会太有用。

SELECT tower.id, (SELECT COUNT(traj.*) 
                  FROM mivehdetailedtrajectory traj
                  WHERE ST_Contains(tower.geom, traj.location))
FROM cell_data_tower tower
ORDER BY tower.id;

首先尝试使用LIMIT 1运行它。总运行时间应该是一个塔*塔的运行时间。

Answer 2

我没有像你这样大的数据库，只有80M。但在我的情况下，我创建一个LinkID字段来了解每个geom的位置，并在插入新记录时计算哪一个是最接近的LinkID。

当我发现单个LinkID需要30ms并且这样做80M次需要27天我从预先计算这些值。

此外，我不保留所有记录，我只能随时保留一个月。

优化大型PostGIS查询

2 个答案: