Question

我有一个~2TB完全抽真空的Redshift表，带有distkey (phash, last_seen)（高基数，数亿个值）和复合排序键SELECT DISTINCT ret_field FROM table WHERE phash IN ( '5c8615fa967576019f846b55f11b6e41', '8719c8caa9740bec10f914fc2434ccfd', '9b657c9f6bf7c5bbd04b5baf94e61dae' ) AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'。

当我进行如下查询时：

phash

它很快就会回归。但是，当我将哈希数增加到10以上时，Redshift会将IN条件从一堆OR转换为数组，每http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

问题是当我有几十个EXPLAIN值时，“优化”查询从不到一秒的响应时间变为超过半小时。换句话说，它停止使用sortkey并进行全表扫描。

知道如何防止这种行为并保留使用sortkeys来保持查询的快速性吗？

以下是＆lt; 10哈希和＆gt; 10哈希之间的XN Unique (cost=0.00..157253450.20 rows=43 width=27) -> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27) Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))差异：

少于10（0.4秒）：

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
    ->  XN Seq Scan on table  (cost=0.00..179718164.48 rows=906830708 width=27)
                Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))

超过10（45-60分钟）：

h = msgbox(' 3');
set(h, 'position', [500 300 100 100]); %makes box bigger
ah = get( h, 'CurrentAxes' );
ch = get( ah, 'Children' );
set( ch, 'FontSize', 70 ); %makes text bigger
pause(1)
h = msgbox(' 2');
set(h, 'position', [500 300 100 100]); %makes box bigger
ah = get( h, 'CurrentAxes' );
ch = get( ah, 'Children' );
set( ch, 'FontSize', 70 ); %makes text bigger
pause(1)
h = msgbox(' 1');
set(h, 'position', [500 300 100 100]); %makes box bigger
ah = get( h, 'CurrentAxes' );
ch = get( ah, 'Children' );
set( ch, 'FontSize', 70 ); %makes text bigger
pause(1)

Answer 1

您可以尝试创建临时表/子查询：

SELECT DISTINCT t.ret_field
FROM table t
JOIN (
   SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
   UNION ALL 
   SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
   UNION ALL
   SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
   -- UNION ALL
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

或者在块中搜索（如果查询优化器将其合并为一个，使用辅助表来存储中间结果）：

SELECT ret_field
FROM table
WHERE phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae')
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION 
-- ...

如果查询优化器将其合并为一个，您可以尝试将临时表用于中间结果

修改

SELECT DISTINCT t.ret_field FROM table t JOIN (SELECT ... AS phash FROM ... ) AS sub ON t.phash = sub.phash WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

Answer 2

你真的需要DISTINCT吗？这个运营商可能很贵。

我尝试使用LATERAL JOIN。在下面的查询中，表格Hashes有一列phash - 这是您的大批哈希值。它可以是临时表，（子）查询，任何东西。

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN LATERAL
    (
        SELECT table.ret_field
        FROM table
        WHERE
            table.phash = Hashes.phash
            AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
    ) AS T ON true

优化器很可能将LATERAL JOIN实现为嵌套循环。它将循环遍历Hashes中的所有行，并且每行都运行SELECT FROM table。内部SELECT应使用(phash, last_seen)上的索引。为了安全起见，还要将ret_field包含在索引中，以使其成为覆盖索引：(phash, last_seen, ret_field)。

@Diego在答案中有一个非常有效的观点：不是将常量phash值放入查询中，而是将它们放在临时表或永久表中。

我想扩展@Diego的答案，并补充说这个带哈希的表具有索引，唯一索引是很重要的。

因此，创建一个表Hashes，其中一列phash与您的主table.phash中的列完全相同。类型匹配很重要。使该列成为具有唯一聚簇索引的主键。将您的几十个phash值转储到Hashes表中。

然后查询变为简单的INNER JOIN，而不是横向：

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN table ON table.phash = Hashes.phash
WHERE
    table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

table (phash, last_seen, ret_field)上的索引仍然很重要。

优化器应该能够利用这两个连接表按phash列排序并且在Hashes表中唯一的事实。

Answer 3

值得尝试设置sortkeys (last_seen, phash)，先放置last_seen。

缓慢的原因可能是因为排序键的前导列是phash，它看起来像一个随机字符。正如AWS redshift开发文档所述，如果将条件列用于where条件，则timestamp列应作为排序键的前导列。

如果最常查询最近的数据，请指定时间戳列作为排序键的前导列。 - Choose the Best Sort Key - Amazon Redshift

使用此排序键的顺序，所有列将按last_seen排序，然后排序phash。（What does it mean to have multiple sortkey columns?）

需要注意的是，您必须重新创建表以更改排序键。 This会帮助您做到这一点。

Answer 4

你可以摆脱＆＃34; ORs＆＃34;将所需数据插入临时表并将其与实际表格连接。

这是一个例子（我使用CTE，因为当你有多个SQL语句时，我使用的工具难以捕获计划 - 但如果可以，请使用临时表）

select * 
from <my_table>
where checksum in 
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)

VERSUS

with foo as
(
    select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
    select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
    select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
    select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select  * 
from <my_table> r 
     join foo f on r.checksum = F.my_key

这是计划，因为你可以看到它看起来更复杂，但由于CTE的原因，它不会在临时表上看起来那样：

Answer 5

您是否尝试将union用于所有phash值？

就像那样：

SELECT ret_field 
FROM   table 
WHERE  phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

-- and so on...

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

为Redshift查询优化大IN条件

5 个答案: