使用索引优化Postgres查询以获取大量数据

时间:2015-12-27 19:05:00

标签: ruby-on-rails postgresql optimization postgresql-performance

我有一个数据库posts,其中有大约2000万行。我正在尝试使用以下查询缩小分页列表的帖子:

SELECT "posts".* FROM "posts"
WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND "posts"."deleted_at" IS NULL
ORDER BY external_created_at desc LIMIT 100 OFFSET 0;

(大约有330万行与查询中的source_id匹配)

当我这样做时,需要大约60秒,我得到以下EXPLAIN ANALYZEsee on depesz):

EXPLAIN ANALYZE SELECT  "posts".* FROM "posts"  WHERE "posts"."source_id" IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND "posts"."deleted_at" IS NULL O
RDER BY external_created_at desc LIMIT 100 OFFSET 0;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2530223.38..2530223.63 rows=100 width=1040) (actual time=66564.583..66564.616 rows=100 loops=1)
   ->  Sort  (cost=2530223.38..2534981.19 rows=1903125 width=1040) (actual time=66564.571..66564.594 rows=100 loops=1)
         Sort Key: external_created_at
         Sort Method: top-N heapsort  Memory: 89kB
         ->  Bitmap Heap Scan on posts  (cost=35499.76..2457487.31 rows=1903125 width=1040) (actual time=279.640..64496.330 rows=1674072 loops=1)
               Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))
               Rows Removed by Index Recheck: 4640188
               ->  Bitmap Index Scan on index_on_posts_partial_source_id_with_order  (cost=0.00..35023.98 rows=1903125 width=0) (actual time=275.922..275.922 rows=1674072 loops=1)
                     Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
 Total runtime: 66564.962 ms
(10 rows)

这是它正在使用的索引:

CREATE INDEX index_on_posts_partial_source_id_with_order ON posts USING btree (source_id) WHERE (deleted_at IS NULL);

似乎Recheck Cond是这个查询最慢的事情。我所看到的关于Recheck条件的所有内容都涉及增加postgres使用的内存,因为数据是" lossy"但我在查询计划中没有看到类似内容。

有关如何加快速度的任何建议吗?

似乎以某种方式摆脱了Recheck,或以某种方式排序external_created_at将是我最好的选择。

编辑:我使用的是postgres版本9.3.4。这是帖子表:

CREATE TABLE posts (
    id integer NOT NULL,
    source_id integer,
    message text,
    image text,
    external_id text,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    external text,
    like_count integer DEFAULT 0 NOT NULL,
    comment_count integer DEFAULT 0 NOT NULL,
    external_created_at timestamp without time zone,
    deleted_at timestamp without time zone,
    poster_name character varying(255),
    poster_image text,
    poster_url character varying(255),
    poster_id text,
    "position" integer,
    location character varying(255),
    description text,
    video text,
    rejected_at timestamp without time zone,
    deleted_by character varying(255),
    height integer,
    width integer
);

1 个答案:

答案 0 :(得分:1)

您的查询为分页列表返回了几百万行。仔细思考为这么多页面返回数据的智慧。另外,要认真考虑是否需要所有列。我怀疑你这样做。

我构建了一个粗糙的表格,并在其中插入了大约1000万个随机(ish)行。我使用PostgreSQL 9.4的查询计划与你的查询计划大致相似。

"Limit  (cost=138609.10..138609.35 rows=100 width=24) (actual time=1410.012..1410.038 rows=100 loops=1)"
"  ->  Sort  (cost=138609.10..140344.25 rows=694059 width=24) (actual time=1410.010..1410.026 rows=100 loops=1)"
"        Sort Key: external_created_at"
"        Sort Method: top-N heapsort  Memory: 29kB"
"        ->  Bitmap Heap Scan on posts  (cost=12217.47..112082.66 rows=694059 width=24) (actual time=374.393..919.687 rows=3000000 loops=1)"
"              Recheck Cond: ((source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) AND (deleted_at IS NULL))"
"              Heap Blocks: exact=16217"
"              ->  Bitmap Index Scan on index_on_posts_partial_source_id_with_order  (cost=0.00..12043.95 rows=694059 width=0) (actual time=370.593..370.593 rows=3000000 loops=1)"
"                    Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))"
"Planning time: 0.264 ms"
"Execution time: 1410.097 ms"

向external_created_at添加索引会将执行时间减少约470倍。但我没有相同的值分布。

create index on test.posts (external_created_at);
analyze test.posts;
explain analyze
select * from test.posts
where source_id in (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
and deleted_at is null
order by external_created_at desc limit 100 offset 0;
"Limit  (cost=0.43..131.43 rows=100 width=24) (actual time=0.219..2.992 rows=100 loops=1)"
"  ->  Index Scan Backward using posts_external_created_at_idx on posts  (cost=0.43..900991.48 rows=687808 width=24) (actual time=0.216..2.976 rows=100 loops=1)"
"        Filter: ((deleted_at IS NULL) AND (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])))"
"        Rows Removed by Filter: 350"
"Planning time: 0.302 ms"
"Execution time: 3.024 ms"