我在复杂查询中计算行的方式有什么问题?

时间:2014-05-24 22:11:17

标签: sql database postgresql count postgresql-performance

我有一个包含几个表的数据库,每个表有几百万行(表有索引)。我需要计算表中的行数,但只计算那些外键字段指向另一个表的子集的行 这是查询:

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  ( top.production_year >= 1982 
                  AND top.production_year <= 1984 
                  AND top.kind_id IN( 1, 2 ) 
                   OR EXISTS(SELECT 1 
                             FROM   title sub 
                             WHERE  sub.episode_of_id = top.id 
                                    AND sub.production_year >= 1982 
                                    AND sub.production_year <= 1984 
                                    AND sub.kind_id IN( 1, 2 )) )) 
SELECT Count(*) 
FROM   cast_info 
WHERE  EXISTS(SELECT 1 
              FROM   filtered_title 
              WHERE  cast_info.movie_id = filtered_title.id) 
       AND cast_info.role_id IN( 3, 8 ) 

我使用CTE,因为对于使用相同子查询的其他表,还有更多的COUNT查询。但是我试图摆脱CTE并且结果是一样的:我第一次执行查询它运行...运行...运行超过十分钟。我第二次执行查询时,它只有4秒,这对我来说是可以接受的。

EXPLAIN ANALYZE的结果:

Aggregate  (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
  CTE filtered_title
    ->  Seq Scan on title top  (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
          Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
          Rows Removed by Filter: 2832906
          SubPlan 1
            ->  Index Scan using title_idx_epof on title sub  (cost=0.43..16.16 rows=1 width=0) (never executed)
                  Index Cond: (episode_of_id = top.id)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
          SubPlan 2
            ->  Seq Scan on title sub_1  (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
                  Rows Removed by Filter: 2832906
  ->  Nested Loop  (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
        ->  HashAggregate  (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
              ->  CTE Scan on filtered_title  (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
        ->  Index Scan using cast_info_idx_mid on cast_info  (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
              Index Cond: (movie_id = filtered_title.id)
              Filter: (role_id = ANY ('{3,8}'::integer[]))
              Rows Removed by Filter: 15
Total runtime: 127729.100 ms

现在回答我的问题。我做错了什么,我该如何解决?

我尝试了相同查询的一些变体:独占连接,连接/存在。一方面,这个似乎需要最少的时间来完成工作(快10倍),但它平均仍然是60秒。另一方面,与第一次在第二次运行中需要4-6秒的查询不同,总是需要60秒。

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  top.production_year >= 1982 
                AND top.production_year <= 1984 
                AND top.kind_id IN( 1, 2 ) 
                 OR EXISTS(SELECT 1 
                           FROM   title sub 
                           WHERE  sub.episode_of_id = top.id 
                                  AND sub.production_year >= 1982 
                                  AND sub.production_year <= 1984 
                                  AND sub.kind_id IN( 1, 2 ))) 
SELECT Count(*) 
FROM   cast_info 
       join filtered_title 
         ON cast_info.movie_id = filtered_title.id 
WHERE  cast_info.role_id IN( 3, 8 ) 

1 个答案:

答案 0 :(得分:4)

免责声明:有太多因素可以作出决定性的答案。信息with a few tables, each has a few millions rows (tables do have indexes) 只是没有删除。它取决于基数,表定义,数据类型,使用模式和(可能是最重要的)索引。当然,还有db服务器的正确基本配置。所有这些都超出了关于SO的单个问题的范围。从标记中的链接开始。或聘请专业人士。

我将在您的查询计划中解决最突出的细节(对我而言):

title上的顺序扫描?

  

- &GT;标题sub_1上的 Seq Scan (成本= 0.00..90471.23行= 11657宽度= 4)(实际时间= 0.071..730.311 行= 16250 循环= 1)
        过滤:((production_year&gt; = 1982)AND(production_year&lt; = 1984)AND(kind_id = ANY(&#39; {1,2}&#39; :: integer [])))
        已删除的行数:2832906

大胆强调我的。顺序扫描300万行以仅检索16250不是非常有效。顺序扫描也是第一次运行需要更长时间的可能原因。后续调用可以从缓存中读取数据。由于表格很大,除非你有大量的缓存,否则数据可能不会长时间停留在缓存中。

从大表中收集0.5%的行,索引扫描通常要快得多。可能的原因:

我的钱在索引上。你没有提供你的Postgres版本,所以假设当前的9.3。 查询的完美索引是:

CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)

数据类型很重要。索引中列的顺序很重要。
kind_id首先,因为经验法则是:index for equality first — then for ranges 最后两列(id, episode_of_id)仅对潜在的仅索引扫描有用。如果不适用,请删除它们。更多细节:
PostgreSQL composite primary key

您构建查询的方式最终会在大表上进行两次顺序扫描。所以这是一个有根据的猜测...

更好的查询

WITH t_base AS (
   SELECT id, episode_of_id
   FROM   title
   WHERE  kind_id BETWEEN 1 AND 2
   AND    production_year BETWEEN 1982 AND 1984 
   )
, t_all AS (
   SELECT id FROM t_base

   UNION  -- not UNION ALL (!)
   SELECT id
   FROM  (SELECT DISTINCT episode_of_id AS id FROM t_base) x
   JOIN   title t USING (id)
   )
SELECT count(*) AS ct
FROM   cast_info c
JOIN   t_all t ON t.id = c.movie_id 
WHERE  c.role_id IN (3, 8);

这应该为您在新title_foo_idx上进行一次索引扫描,并在title的pk索引上进行另一次索引扫描。其余应该相对便宜。运气好,比以前快得多。

  • kind_id BETWEEN 1 AND 2 ..只要你有一个连续的值范围,这比列出单个值更快,因为这样Postgres可以从索引中获取连续范围。仅仅两个值不是很重要。

  • t_all的第二站测试此替代方案。不确定哪个更快:

       SELECT id
       FROM   title t 
       WHERE  EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
    

临时表而不是CTE

你写道:

  

我使用CTE,因为其他地方有更多COUNT个查询   表,使用相同的子查询。

CTE构成优化障碍,生成的内部工作表未编入索引。当多次重复使用结果(具有多个微不足道的行数)时,使用索引临时表代价是值得的。为简单的int列创建索引很快。

CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
   SELECT id, episode_of_id
   FROM   title
   WHERE  kind_id BETWEEN 1 AND 2
   AND    production_year BETWEEN 1982 AND 1984 
   )
SELECT id FROM t_base
UNION
SELECT id FROM title t 
WHERE  EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);

ANALYZE t_tmp;                       -- !
CREATE UNIQUE INDEX ON t_tmp (id);   -- ! (unique is optional)

SELECT count(*) AS ct
FROM   cast_info c
JOIN   t_tmp t ON t.id = c.movie_id 
WHERE  c.role_id IN (3, 8);

-- More queries using t_tmp

关于临时表:
How to tell if record has changed in Postgres