Question

我注意到下面的查询运行缓慢，在详细查看之后，我想知道为什么Redshift会首先扫描两个表（事件和联系人），然后将它们连接在一起。联系表中有超过300,000行。我的期望是Redshift应首先根据为其指定的过滤器扫描大型事件表，然后根据Contact_IDs列查找其中的联系人。我的期望不正确吗？我还能做些什么来加快查询速度吗？我在所有桌子上执行了真空和分析。

查询：

select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
Events et 
jon contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
where
et.Account_ID = 5
and et.Event_ID in (1, 2)
and et.IsGuest = 0
and et.dim_date_id >=20151125 
and et.dim_date_id <=20160226
group by c.Segment
order by 1

说明：

XN Merge (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Network (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Sort (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN HashAggregate (cost=74927.80..74927.81 rows=1 width=20)
-> XN Merge Join DS_DIST_NONE (cost=0.00..74927.57 rows=31 width=20)
-> XN Seq Scan on contact c (cost=0.00..497.56 rows=39805 width=16)
-> XN Seq Scan on eventtransaction et (cost=0.00..6664.84 rows=136 width=20)

Answer 1

仅在执行连接后才应用过滤器。如果您希望在应用过滤器后进行连接，我建议您创建一个临时表，并将其与您在代码中指示的联系表一起加入。

select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
(
  select Event_ID, Account_ID, Contact_ID
  FROM event
  WHERE
    et.Account_ID = 5
    and et.Event_ID in (1, 2)
    and et.IsGuest = 0
    and et.dim_date_id >=20151125 
    and et.dim_date_id <=20160226
)et 
join contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
group by c.Segment
order by 1

此外，如果您在dim_date_id上设置了排序键，您会看到此查询的速度有所提升。有关相同内容的更多详细信息，请参见here

Redshift查询执行计划

1 个答案: