Postgresql没有使用索引

时间:2016-05-10 12:20:07

标签: json postgresql postgres-9.4

我有大桌面碎屑(大约100M +行,100GB)。它只是以文本形式存储的json集合。它在列run_id上有索引,它具有大约10K的唯一值。所以每次运行都很小(1K - 1M行)。

简单查询:

explain analyze verbose select * from crumbs c 
where c.run_id='2016-04-26T19_02_01_015Z' limit 10

计划很好:

Limit  (cost=0.56..36.89 rows=10 width=2262) (actual time=1.978..2.016 rows=10 loops=1)
  Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
  ->  Index Scan using index_crumbs_on_run_id on public.crumbs c  (cost=0.56..5533685.73 rows=1523397 width=2262) (actual time=1.975..1.996 rows=10 loops=1)
        Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
        Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
Planning time: 0.117 ms
Execution time: 2.048 ms

但是如果我试着查看存储在其中一列中的json,那么它想要进行全扫描:

explain verbose select x from crumbs c, 
lateral json_array_elements(c.content::json) x
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

安排:

Limit  (cost=0.01..0.69 rows=10 width=32)
  Output: x.value
  ->  Nested Loop  (cost=0.01..10332878.67 rows=152343800 width=32)
        Output: x.value
        ->  Seq Scan on public.crumbs c  (cost=0.00..7286002.66 rows=1523438 width=895)
              Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
              Filter: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
        ->  Function Scan on pg_catalog.json_array_elements x  (cost=0.01..1.01 rows=100 width=32)
              Output: x.value
              Function Call: json_array_elements((c.content)::json)

尝试:

analyze crumbs

但没有任何区别。

更新1 禁用整个数据库的顺序扫描工作,但这不是我们的应用程序中的选项。在许多其他地方seq扫描应保持:

set enable_seqscan=false;

计划:

Limit  (cost=0.57..1.14 rows=10 width=32) (actual time=0.120..0.294 rows=10 loops=1)
  Output: x.value
  ->  Nested Loop  (cost=0.57..8580698.45 rows=152343400 width=32) (actual time=0.118..0.273 rows=10 loops=1)
        Output: x.value
        ->  Index Scan using index_crumbs_on_run_id on public.crumbs c  (cost=0.56..5533830.45 rows=1523434 width=895) (actual time=0.087..0.107 rows=10 loops=1)
              Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
              Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
        ->  Function Scan on pg_catalog.json_array_elements x  (cost=0.01..1.01 rows=100 width=32) (actual time=0.011..0.011 rows=1 loops=10)
              Output: x.value
              Function Call: json_array_elements((c.content)::json)
Planning time: 0.124 ms
Execution time: 0.337 ms

更新2

架构是:

CREATE TABLE crumbs
(
  id serial NOT NULL,
  run_id character varying(255),
  content text,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  CONSTRAINT crumbs_pkey PRIMARY KEY (id)
);

CREATE INDEX index_crumbs_on_run_id
  ON crumbs
  USING btree
  (run_id COLLATE pg_catalog."default");

更新3

像这样重写查询:

select json_array_elements(c.content::json) x
from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

获取正确的计划。仍然不清楚为什么选择错误的计划进行第二次查询。

3 个答案:

答案 0 :(得分:0)

你有三个不同的问题。首先,第一个查询中的limit 10正在倾向于计划程序,转而使用索引扫描,否则所有行都匹配run_id是非常昂贵的。为了便于比较,您可能希望在删除限制时查看第一个(未加入的)查询计划是什么样的。我的猜测是规划师切换到桌面扫描。

其次,横向连接是不必要的,并且抛弃了计划者。您可以在select子句中展开内容数组的元素,如下所示:

select json_array_elements(content::json)
from crumbs
where run_id = '2016-04-26T19_02_01_015Z'
;

这更有可能使用索引扫描来为run_id挑选行,然后为你取消“数组元素”。

但第三个隐藏的问题是你真正想要得到的。如果您按原样运行此最后一个查询,那么您与第一个(未加入的)查询在同一条船上没有限制,这意味着您可能无法获得索引扫描(如果您是阅读这么大一部分表。)

您是否只想要运行所有内容数组中的前几个任意数组元素?如果是这样的话,那么在这里加上限制条款就应该是故事的结尾。如果你想要这个特定运行的所有数组元素,那么你可能只需要接受一个表扫描,虽然没有横向连接你可能比原始查询更好的情况。

答案 1 :(得分:0)

重写查询以便首先应用 然后然后对函数进行交叉连接应该使Postgres使用索引:

使用派生表:

select x 
from (
    select *
    from crumbs 
    where run_id='2016-04-26T19_02_01_015Z' 
    limit 10
) c 
  cross join lateral json_array_elements(c.content::json) x

或者使用CTE:

with c as (
  select *
  from crumbs 
  where run_id='2016-04-26T19_02_01_015Z' 
  limit 10
)
select x
from c 
  cross join lateral json_array_elements(c.content::json) x

或直接在选择列表中使用json_array_elements()

select json_array_elements(c.content::json) 
from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

然而,这与其他两个查询不同,因为它在“取消”json数组之后应用限制,而不是从crumbs表返回的行数(其中是你的第一个查询正在做的事情。)

答案 2 :(得分:0)

数据建模建议:

        -- Suggest replacing the column run_id (low cardinality, and rather fat)
        -- by a reference to a domain table, like:
        -- ------------------------------------------------------------------
CREATE TABLE runs
        ( run_seq serial NOT NULL PRIMARY KEY
        , run_id character varying UNIQUE
        );

        -- Grab all the distinct values occuring in crumbs.run_id
        -- -------------------------------------------------------
INSERT INTO runs (run_id)
SELECT DISTINCT run_id FROM crumbs;

        -- Add an FK column
        -- -----------------
ALTER TABLE crumbs
        ADD COLUMN run_seq integer REFERENCES runs(run_seq)
        ;

UPDATE crumbs c
SET run_seq = r.run_seq
FROM runs r
WHERE r.run_id = c.run_id
        ;
VACUUM ANALYZE runs;

        -- Drop old column and set new column to not nullable
        -- ---------------------------------------------------
ALTER TABLE crumbs
        DROP COLUMN run_id
        ;
ALTER TABLE crumbs
        ALTER COLUMN run_seq SET NOT NULL
        ;

        -- Recreate the supporting index for the FK
        -- adding id to support index-only lookups
        -- (and enforce uniqueness)
        -- -------------------------------------
CREATE UNIQUE INDEX index_crumbs_run_seq_id ON crumbs (run_seq,id)
        ;

        -- Refresh statistics
        -- ------------------
VACUUM ANALYZE crumbs; -- this may take some time ...

-- and then: join the runs table to your original crumbs table
-- -----------------------------------------------------------
-- explain analyze 
SELECT x FROM crumbs c
JOIN runs r ON r.run_seq = c.run_seq
        , lateral json_array_elements(c.content::json) x
WHERE r.run_id='2016-04-26T19_02_01_015Z'
LIMIT 10
        ;

或者:使用其他回答者的建议和类似的联接。

但可能更好:用实际时间戳替换丑陋的run_id文本字符串。