在雪花中使用嵌套窗口函数

时间:2021-04-08 21:34:06

标签: sql nested snowflake-cloud-data-platform window-functions

我已经看到很多关于这个一般错误的问题,但我不明白为什么会出现这个错误,可能是因为嵌套的窗口函数......


通过下面的查询,我得到了 Col_CCol_D、...以及我尝试过的几乎所有内容的错误

<块引用>

SQL 编译错误:[eachColumn] 不是一个有效的 group by 表达式

SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    FIRST_VALUE(CASE WHEN Col_T = 'testvalue'
                THEN LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                                    ORDER BY Col_TimeStamp DESC 
                                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                ELSE NULL END) IGNORE NULLS 
                                    OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM mytable

那么,是否有一种在 Snowflake 中使用嵌套窗口函数的方法(使用 case when ...)?如果是这样,我怎么/我做错了什么?

2 个答案:

答案 0 :(得分:2)

因此解构您的逻辑以表明它是导致问题的第二个 FIRST_VALUE

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp DESC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
    IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
    /*,FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as the_problem*/
FROM data
ORDER BY Col_A,Col_B, col_timestamp
;

如果我们取消对 the_problem 的注释,我们就拥有了.. 与 PostgreSQL(我的背景)相比,仅仅重用这么多先前的结果/步骤是一种礼物,所以在这里我只是破坏了另一个 SELECT 层。

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT *,
    FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as not_a_problem
FROM (
    SELECT
        Col_A,
        Col_B,
        FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                        ORDER BY Col_TimeStamp ASC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
        MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp ASC
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp DESC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
        IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
        ,Col_TimeStamp
    FROM data
)
ORDER BY Col_A,Col_B, Col_TimeStamp

然后一切正常。如果您 LAG 然后 IFF/FIRST_VALUE 然后 LAG 第二个结果,也会发生这种情况。

答案 1 :(得分:1)

<块引用>

“我已经看到很多关于这个一般错误的问题,但我不明白为什么会出现这个错误,可能是因为嵌套的窗口函数......”

Snowflake 支持在同一级别重用表达式(有时称为 "lateral column alias reference"

写起来完全没问题:

if(snapshot.data().containsKey("lastAccess")){

}
else{

}

在标准 SQL 中,您将不得不使用多级查询 (cte) 或使用 LATERAL JOIN。相关:PostgreSQL: using a calculated column in the same query


不幸的是,相同的语法不适用于分析函数(我现在知道任何支持它的 RDMBS):

SELECT 1+1 AS col1,
       col1 *2 AS col2,
       CASE WHEN col1 > col2 THEN 'Y' ELSE 'NO' AS col3
       ...

在 SQL Standard 2016 中有一个可选特性:T619 嵌套窗口函数

这里有一篇文章介绍了嵌套分析函数查询的样子:Nested window functions in SQL

这意味着当前嵌套窗口函数的方法是使用派生表/cte:

SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
      ,MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM tab;