Question

我有一个包含几个表的数据库，每个表有几百万行（表有索引）。我需要计算表中的行数，但只计算那些外键字段指向另一个表的子集的行这是查询：

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  ( top.production_year >= 1982 
                  AND top.production_year <= 1984 
                  AND top.kind_id IN( 1, 2 ) 
                   OR EXISTS(SELECT 1 
                             FROM   title sub 
                             WHERE  sub.episode_of_id = top.id 
                                    AND sub.production_year >= 1982 
                                    AND sub.production_year <= 1984 
                                    AND sub.kind_id IN( 1, 2 )) )) 
SELECT Count(*) 
FROM   cast_info 
WHERE  EXISTS(SELECT 1 
              FROM   filtered_title 
              WHERE  cast_info.movie_id = filtered_title.id) 
       AND cast_info.role_id IN( 3, 8 )

我使用CTE，因为对于使用相同子查询的其他表，还有更多的COUNT查询。但是我试图摆脱CTE并且结果是一样的：我第一次执行查询它运行...运行...运行超过十分钟。我第二次执行查询时，它只有4秒，这对我来说是可以接受的。

EXPLAIN ANALYZE的结果：

Aggregate  (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
  CTE filtered_title
    ->  Seq Scan on title top  (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
          Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
          Rows Removed by Filter: 2832906
          SubPlan 1
            ->  Index Scan using title_idx_epof on title sub  (cost=0.43..16.16 rows=1 width=0) (never executed)
                  Index Cond: (episode_of_id = top.id)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
          SubPlan 2
            ->  Seq Scan on title sub_1  (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
                  Rows Removed by Filter: 2832906
  ->  Nested Loop  (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
        ->  HashAggregate  (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
              ->  CTE Scan on filtered_title  (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
        ->  Index Scan using cast_info_idx_mid on cast_info  (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
              Index Cond: (movie_id = filtered_title.id)
              Filter: (role_id = ANY ('{3,8}'::integer[]))
              Rows Removed by Filter: 15
Total runtime: 127729.100 ms

现在回答我的问题。我做错了什么，我该如何解决？

我尝试了相同查询的一些变体：独占连接，连接/存在。一方面，这个似乎需要最少的时间来完成工作（快10倍），但它平均仍然是60秒。另一方面，与第一次在第二次运行中需要4-6秒的查询不同，总是需要60秒。

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  top.production_year >= 1982 
                AND top.production_year <= 1984 
                AND top.kind_id IN( 1, 2 ) 
                 OR EXISTS(SELECT 1 
                           FROM   title sub 
                           WHERE  sub.episode_of_id = top.id 
                                  AND sub.production_year >= 1982 
                                  AND sub.production_year <= 1984 
                                  AND sub.kind_id IN( 1, 2 ))) 
SELECT Count(*) 
FROM   cast_info 
       join filtered_title 
         ON cast_info.movie_id = filtered_title.id 
WHERE  cast_info.role_id IN( 3, 8 )

Answer 1

^{免责声明：有太多因素可以作出决定性的答案。信息with a few tables, each has a few millions rows (tables do have indexes) 只是没有删除。它取决于基数，表定义，数据类型，使用模式和（可能是最重要的）索引。当然，还有db服务器的正确基本配置。所有这些都超出了关于SO的单个问题的范围。从postgresql-performance标记中的链接开始。或聘请专业人士。}

我将在您的查询计划中解决最突出的细节（对我而言）：

`title`上的顺序扫描？

- ＆GT;标题sub_1上的 Seq Scan （成本= 0.00..90471.23行= 11657宽度= 4）（实际时间= 0.071..730.311 行= 16250 循环= 1）
        过滤：（（production_year＆gt; = 1982）AND（production_year＆lt; = 1984）AND（kind_id = ANY（＆＃39; {1,2}＆＃39; :: integer []）））
        已删除的行数：2832906

大胆强调我的。顺序扫描300万行以仅检索16250不是非常有效。顺序扫描也是第一次运行需要更长时间的可能原因。后续调用可以从缓存中读取数据。由于表格很大，除非你有大量的缓存，否则数据可能不会长时间停留在缓存中。

从大表中收集0.5％的行，索引扫描通常要快得多。可能的原因：

Statistics are off.
Cost settings are off.
没有匹配的索引。

我的钱在索引上。你没有提供你的Postgres版本，所以假设当前的9.3。此查询的完美索引是：

CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)

数据类型很重要。索引中列的顺序很重要。
kind_id首先，因为经验法则是：index for equality first — then for ranges 最后两列（id, episode_of_id）仅对潜在的仅索引扫描有用。如果不适用，请删除它们。更多细节：
PostgreSQL composite primary key

您构建查询的方式最终会在大表上进行两次顺序扫描。所以这是一个有根据的猜测...

更好的查询

WITH t_base AS ( SELECT id, episode_of_id FROM title WHERE kind_id BETWEEN 1 AND 2 AND production_year BETWEEN 1982 AND 1984 ) , t_all AS ( SELECT id FROM t_base UNION -- not UNION ALL (!) SELECT id FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x JOIN title t USING (id) ) SELECT count(*) AS ct FROM cast_info c JOIN t_all t ON t.id = c.movie_id WHERE c.role_id IN (3, 8);

这应该为您在新title_foo_idx上进行一次索引扫描，并在title的pk索引上进行另一次索引扫描。其余应该相对便宜。运气好，比以前快得多。

kind_id BETWEEN 1 AND 2 ..只要你有一个连续的值范围，这比列出单个值更快，因为这样Postgres可以从索引中获取连续范围。仅仅两个值不是很重要。

为t_all的第二站测试此替代方案。不确定哪个更快：

SELECT id FROM title t WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)

临时表而不是CTE

你写道：

我使用CTE，因为其他地方有更多COUNT个查询表，使用相同的子查询。

CTE构成优化障碍，生成的内部工作表未编入索引。当多次重复使用结果（具有多个微不足道的行数）时，使用索引临时表代价是值得的。为简单的int列创建索引很快。

CREATE TEMP TABLE t_tmp AS WITH t_base AS ( SELECT id, episode_of_id FROM title WHERE kind_id BETWEEN 1 AND 2 AND production_year BETWEEN 1982 AND 1984 ) SELECT id FROM t_base UNION SELECT id FROM title t WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id); ANALYZE t_tmp; -- ! CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional) SELECT count(*) AS ct FROM cast_info c JOIN t_tmp t ON t.id = c.movie_id WHERE c.role_id IN (3, 8); -- More queries using t_tmp

关于临时表：
How to tell if record has changed in Postgres

我在复杂查询中计算行的方式有什么问题？

1 个答案:

`title`上的顺序扫描？

更好的查询

临时表而不是CTE

我在复杂查询中计算行的方式有什么问题？

1 个答案:

title上的顺序扫描？

更好的查询

临时表而不是CTE

`title`上的顺序扫描？