为什么单独的表的性能明显优于子查询?

时间:2016-11-01 21:14:54

标签: sql performance teradata

我试图提高SQL查询的性能并尝试了几种组合。

原始查询

SELECT ALIAS_A.id1, 
       ALIAS_A.id2, 
       ALIAS_B.columnA, 
       ALIAS_C.columnB, 
       ALIAS_B.columnC 
FROM   db_A.table_A ALIAS_A 
       LEFT OUTER JOIN db_A.table_B ALIAS_B 
                    ON ALIAS_A.id2 = ALIAS_B.id2 
       LEFT OUTER JOIN db_B.table_C ALIAS_C 
                    ON ALIAS_B.columnA = ALIAS_C.item_num 
       LEFT OUTER JOIN db_A.table_D ALIAS_D 
                    ON ALIAS_A.id2 = ALIAS_D.id2 
       INNER JOIN db_C.table_E ALIAS_E 
               ON Cast(ALIAS_A.column_date AS DATE) BETWEEN 
                  ALIAS_E.column_startdate AND ALIAS_E.column_enddate 
WHERE  ALIAS_E.fiscalyear >= 2016 
       AND Cast(ALIAS_A.columnD AS DATE) BETWEEN 
           CURRENT_DATE - 5 AND CURRENT_DATE 

以上查询消耗了近400k impactCPU

优化查询1

SELECT New_sub_table.id1, 
       New_sub_table.id2, 
       ALIAS_B.columnA, 
       ALIAS_C.columnB, 
       ALIAS_B.columnC 
--changed part start--
FROM   ( sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN 
           CURRENT_DATE - 5 AND CURRENT_DATE ) New_sub_table -- created a subquery 
--changed part end--
       LEFT OUTER JOIN db_A.table_B ALIAS_B 
                    ON New_sub_table.id2 = ALIAS_B.id2 
       LEFT OUTER JOIN db_B.table_C ALIAS_C 
                    ON ALIAS_B.columnA = ALIAS_C.item_num 
       LEFT OUTER JOIN db_A.table_D ALIAS_D 
                    ON New_sub_table.id2 = ALIAS_D.id2 
       INNER JOIN db_C.table_E ALIAS_E 
               ON Cast(New_sub_table.column_date AS DATE) BETWEEN 
                  ALIAS_E.column_startdate AND ALIAS_E.column_enddate 
WHERE  ALIAS_E.fiscalyear >= 2016 

我想先过滤数据然后再进行连接。在我检查了性能统计数据之后。它消耗了近390k的CPU。没什么区别。

优化查询2

SELECT ALIAS_A.id1, 
       ALIAS_A.id2, 
       ALIAS_B.columnA, 
       ALIAS_C.columnB, 
       ALIAS_B.columnC 
--changed part start--
FROM   INTERMEDIATE_DB.INTERMEDIATE_TABLE ALIAS_A --CREATED AN INTERMEDIATE TABLE
--changed part end--
       LEFT OUTER JOIN db_A.table_B ALIAS_B 
                    ON ALIAS_A.id2 = ALIAS_B.id2 
       LEFT OUTER JOIN db_B.table_C ALIAS_C 
                    ON ALIAS_B.columnA = ALIAS_C.item_num 
       LEFT OUTER JOIN db_A.table_D ALIAS_D 
                    ON ALIAS_A.id2 = ALIAS_D.id2 
       INNER JOIN db_C.table_E ALIAS_E 
               ON Cast(ALIAS_A.column_date AS DATE) BETWEEN 
                  ALIAS_E.column_startdate AND ALIAS_E.column_enddate 
WHERE  ALIAS_E.fiscalyear >= 2016 

MACRO用于将数据加载到中间表

INSERT INTO INTERMEDIATE_DB.INTERMEDIATE_TABLE
sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN 
           CURRENT_DATE - 5 AND CURRENT_DATE

所以我在这里做的是。我使用了一个中间表而不是子查询。首先通过宏加载中间表,然后运行select查询。它现在只消耗50k impactCPU(对于Macro和Select查询组合)。

我的问题 - 即使两个查询背后的逻辑相同(或者我认为是这样),我也无法解释为什么会发生这种情况。如果这是不正确的方法,最佳做法是什么?

1 个答案:

答案 0 :(得分:1)

您的主要问题是Cast(ALIAS_A.columnD AS DATE)。当您检查Explains时,您会注意到此步骤的优化器没有置信度,可能会大大高估返回的行数。

但是当你实现选择时,行数已经更好,并且连接的顺序也会改变。

当您在Cast(ALIAS_A.columnD AS DATE)上收集统计信息时,您可能会获得相同的计划,运行DIAGNOSTIC HELPSTATS ON FOR SESSION;并且Explain应该将此显示为推荐的统计信息