uuid,时间戳和几何的复合GIN / GIST索引

时间:2014-12-24 02:25:18

标签: postgresql indexing postgis k-means

尝试针对下表优化查询

  5 CREATE TABLE t (
  6     uuid4 UUID PRIMARY KEY
  7     , arr TEXT[]
 10     , geom GEOMETRY
 11     , ts TIMESTAMP WITHOUT TIME ZONE
 12 );
 13 CREATE INDEX ON t USING GIST (geom);

看起来像

explain analyze 
SELECT kmeans
, count(*)::int
, ST_X(ST_Centroid(ST_Collect(geom))) AS lon
, ST_Y(ST_Centroid(ST_Collect(geom))) AS lat
, STRING_TO_ARRAY(STRING_AGG(ARRAY_TO_STRING(arr, ','), ','), ',') AS arr 
FROM (
    SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 25) OVER (), geom, arr 
    FROM t 
    WHERE ts > NOW() - '12 hours'::interval 
    AND geom IS NOT NULL 
    AND uuid4 != '9ab0f8cd-9707-41da-8e30-6d29a0f22242'::uuid 
    AND arr @> (SELECT arr FROM t WHERE uuid4 = '9ab0f8cd-9707-41da-8e30-6d29a0f22242'::uuid LIMIT 1) 
    AND ST_Distance_Sphere(ST_MakePoint(-77, 38), geom) < 10000 
) AS ksub 
GROUP BY kmeans 
ORDER BY kmeans;

基本上找到一定距离内的所有行,在时间范围内填充geom,并使arr包含指定arr中的所有项目。使用kmeans-postgresql聚合函数对这些找到的行进行聚类。我现在正在看

GroupAggregate  (cost=347.69..349.59 rows=38 width=98) (actual time=50.034..50.384 rows=25 loops=1)
   ->  Sort  (cost=347.69..347.78 rows=38 width=98) (actual time=49.994..49.999 rows=99 loops=1)
         Sort Key: (kmeans(ARRAY[st_x(t.geom), st_y(t.geom)], 25) OVER (?))
         Sort Method: quicksort  Memory: 42kB
         ->  WindowAgg  (cost=25.18..346.31 rows=38 width=94) (actual time=49.955..49.968 rows=99 loops=1)
               InitPlan 1 (returns $0)
                 ->  Limit  (cost=0.29..8.30 rows=1 width=62) (actual time=0.018..0.018 rows=1 loops=1)
                       ->  Index Scan using t_uuid4_ts_idx on t t_1  (cost=0.29..8.30 rows=1 width=62) (actual time=0.017..0.017 rows=1 loops=1)
                             Index Cond: (uuid4 = '9ab0f8cd-9707-41da-8e30-6d29a0f22242'::uuid)
               ->  Bitmap Heap Scan on t  (cost=16.88..337.34 rows=38 width=94) (actual time=13.363..49.747 rows=99 loops=1)
                     Recheck Cond: (arr @> $0)
                     Filter: ((geom IS NOT NULL) AND (uuid4 <> '9ab0f8cd-9707-41da-8e30-6d29a0f22242'::uuid) AND (ts > (now() - '12:00:00'::interval)) AND (_st_distance('010100
0020E610000000000000004053C00000000000004340'::geography, geography(geom), 0::double precision, false) < 10000::double precision))
                     Rows Removed by Filter: 22989
                     ->  Bitmap Index Scan on t_arr_idx  (cost=0.00..16.87 rows=115 width=0) (actual time=13.072..13.072 rows=23089 loops=1)
                           Index Cond: (arr @> $0)
Total runtime: 50.464 ms

似乎Bitmap堆+位图索引是最佳的索引解决方案,但我一直想知道是否有办法避免额外的过滤和重新检查。有关替代索引的任何想法,我可以构建以提高性能吗?我已经尝试过了:

Indexes:
    "t_pkey" PRIMARY KEY, btree (uuid4)
    "t_geom_idx" gist (geom)
    "t_geom_ts_idx" gist (geom, ts)
    "t_geom_ts_uuid4_idx" gist (geom, ts, (uuid4::text))
    "t_iam_idx" gin (arr)
    "t_ts_geom_idx" gist (ts, geom)
    "t_ts_geom_uuid4_idx" gist (ts, geom, (uuid4::text))
    "t_ts_uuid4_geom_idx" gist (ts, (uuid4::text), geom)
    "t_uuid4_ts_idx" btree (uuid4, ts)

请注意,kmeans是https://github.com/umitanuki/kmeans-postgresql的扩展名。

1 个答案:

答案 0 :(得分:1)

根据JohnBarça的建议,我在我的几何和时间戳上使用了ST_DWithin GIST索引,并将上面发布的同一查询的运行时间减少到不到10毫秒。唯一棘手的部分意识到我需要度数而不是米来进行几何计算(地理位置可以使用米)。 This问题向我指出了一个足够准确的解决方案:

AND ST_DWithin(ST_MakePoint(-77.0710820577842, 37.9940763922052), geom, 10000 / (111.31 * 1000 * COS(ST_Y(ST_MakePoint(-77.0710820577842, 37.9940763922052)) * Pi() / 180))