SQL Server复杂聚合过滤

时间:2018-12-16 05:29:07

标签: sql sql-server

我正在尝试优化一些针对大量数据的查询。我将在这里尝试简化问题。让我们从一个示例表开始:

CREATE TABLE [dbo].[TestTable]
(
    [ProjectID] [INT] NOT NULL,
    [Index] [INT] NOT NULL,
    [Voltage] [DECIMAL](18, 3) NOT NULL,
    [Current] [DECIMAL](18, 3) NOT NULL
)

想象一下我们有以下数据:

ProjectID   Index   Voltage     Current
---------------------------------------
1           1       2.3         3.4 
1           2       2.5         3.3
1           3       2.7         3.0
1           4       2.8         2.9
1           5       2.5         3.1
1           6       2.0         3.4
1           7       1.2         3.5
1           8       0.5         3.0
2           1       2.0         1.0
2           2       5.0         2.0
2           3       3.0         2.0
2           4       1.0         1.0

实际上,我的目标是在索引列排序的起点和终点之间进行一些汇总。当我指的是起点和终点时,例如,我指的是从电压> = 2.5的第一行开始,然后继续直到遇到电压> = 1.5的最后一行

这是一个示例查询来说明:

WITH CTE AS
(
    SELECT
        StartingTable.ProjectID,
        MIN(StartingTable.[Index]) StartingIndex,
        MIN(EndingTable.[Index]) - 1 EndingIndex
    FROM
        TestTable StartingTable
        JOIN TestTable EndingTable ON StartingTable.ProjectID = EndingTable.ProjectID
            AND EndingTable.[Index] > StartingTable.[Index]
    WHERE
        StartingTable.Voltage >= 2.5
        and EndingTable.Voltage <= 1.5
    GROUP BY
        StartingTable.ProjectID
)
SELECT
    TestTable.ProjectID,
    MAX(Voltage) MaxVoltage,
    StartingIndex,
    EndingIndex
FROM
    TestTable
    JOIN CTE ON TestTable.ProjectID = CTE.ProjectID
        AND TestTable.[Index] >= StartingIndex
        AND TestTable.[Index] <= EndingIndex
GROUP BY
    TestTable.ProjectID,
    StartingIndex,
    EndingIndex

在示例中,它应该返回:

ProjectID MaxVoltage StartingIndex EndingIndex
1         2.800      2             6
2         5.000      2             3

那行得通,但是我真的不喜欢两次加入TestTable来获取开始和结束索引。我们正在处理一个表,我认为该表最终可能会包含价值TB的数据,因此我认为这是一个糟糕的选择。我只是不知道该怎么办。

我正在考虑某种使用窗口函数的方法,但是我不确定是否有可能。几乎就像我要这样做:

MAX(Voltage) OVER (PARTITION BY ProjectID ORDER BY [Index] ROWS BETWEEN Voltage >= 2.5 AND Voltage >= 1.5)

我还没有看到类似的可能性。我还提出了以下建议:

WITH CTE AS
(
    SELECT
        ProjectID,
        [Index],
        MAX(Voltage) OVER (PARTITION BY ProjectId ORDER BY [Index] ROWS UNBOUNDED PRECEDING) MaxVoltage
    FROM
        TestTable
)
SELECT
    TestTable.ProjectID,
    MAX(Voltage) MaxVoltage,
    MIN(TestTable.[Index]) StartingIndex,
    MAX(TestTable.[Index]) EndingIndex
FROM
    TestTable
    JOIN CTE ON TestTable.ProjectID = CTE.ProjectID
        AND TestTable.[Index] = CTE.[Index]
WHERE
    MaxVoltage >= 2.5
    AND Voltage >= 1.5
GROUP BY
    TestTable.ProjectID

我不确定这会好得多。有没有比我已经尝试过的更好的选择了?

3 个答案:

答案 0 :(得分:2)

如果电压从不超过2.5,然后低于1.5,然后再次高于1.5,则可以应用条件聚合

SELECT
   ProjectID,
   max(Voltage) as MaxVoltage,
   MIN(case when Voltage >= 2.5 then [index] end) AS StartingIndex,
   MAX(case when Voltage >= 1.5 then [index] end) AS EndingIndex
FROM TestTable
group by ProjectID
having MAX(Voltage) >= 2.5 -- to filter group which never reached 2.5

请参见rextester fiddle

编辑:

如果您的Voltage重复了2.5到1.5之间的组,则只要[index]列中没有空格,@ Clockwork-Muse的查询#2会正常工作,否则它将一个结果行分成两组。如果要忽略差距,请执行以下选择操作,以返回预期结果:

with cte as 
(
   SELECT
      ProjectID,
      [Index],
      Voltage,
      max(case when Voltage < 1.5 then [Index] end)
      over (partition by ProjectID
            order by [Index]
            rows unbounded preceding) AS grp -- same value for a range of rows >= 1.5
   FROM TestTable
 )
select
   ProjectID,
   max(Voltage) as MaxVoltage,
   MIN(case when Voltage >= 2.5 then [index] end) AS StartingIndex,
   MAX([index]) AS EndingIndex
from cte
where Voltage >=1.5
group by ProjectID, grp
having MAX(Voltage) >= 2.5 -- to filter group which never reached 2.5
order by ProjectID, grp
;

这会用Voltage >= 1.5对连续的行进行分组,并在低于1.5时启动一个新组,请参阅Clockwork-Muse修改后的db<>fiddle

答案 1 :(得分:0)

SELECT tt.ProjectID, 
       MAX(tt.Voltage) AS MaxVoltage,
       x.StartIndex,
       MAX(tt.[Index]) AS EndIndex
FROM TestTable AS tt
JOIN 
(  
    SELECT ProjectID, 
           MIN([Index]) AS StartIndex
    FROM TestTable
    WHERE Voltage >= 2.5
    GROUP BY ProjectID 
) AS x ON tt.ProjectID = x.ProjectID 
WHERE tt.Voltage >= 1.5 
  AND tt.[Index] >= x.StartIndex
GROUP BY tt.ProjectID, x.StartIndex

在此处查看完整测试:https://rextester.com/BCVL10968

答案 2 :(得分:0)

如果像您的示例数据集中那样,电压仅在达到1.5伏后才降低(并且永远不会重复),我们可以通过使用条件聚合来作弊:

SELECT [ProjectID], MAX([Voltage]) AS MaxVoltage, 
       MIN(CASE WHEN [Voltage] >= 2.5 THEN [Index] END) AS [StartingIndex],
       MAX(CASE WHEN [Voltage] >= 1.5 THEN [Index] END) AS [EndingIndex]
FROM [dbo].[TestTable]
WHERE [Voltage] >= 1.5
GROUP BY [ProjectId]
HAVING MAX([Voltage]) >= 2.5

Example Fiddle
产生要求的内容:

ProjectID | MaxVoltage | StartingIndex | EndingIndex
--------: | :--------- | ------------: | ----------:
        1 | 2.800      |             2 |           6
        2 | 5.000      |             2 |           3

另一方面,如果我们需要警惕重启,事情会变得更加复杂,并且我们需要将其转变为解决方案的一种变体:

SELECT [ProjectID], MAX([Voltage]) AS [MaxVoltage],
       MIN(CASE WHEN [Voltage] >= 2.5 THEN [Index] END) AS [StartingIndex],
       MAX(CASE WHEN [Voltage] >= 1.5 THEN [Index] END) AS [EndingIndex]
FROM (SELECT [ProjectId], [Index], [Voltage], 
             [Index] - ROW_NUMBER() OVER(PARTITION BY [ProjectID] ORDER BY [Index]) AS [VoltageRun]
      FROM [dbo].[TestTable]       
      WHERE [Voltage] >= 1.5) [TestTable]
GROUP BY [ProjectID], [VoltageRun]
HAVING MAX([Voltage]) >= 2.5
ORDER BY [ProjectID], [VoltageRun]

Example Fiddle

之所以有用,是因为您的表可以方便地存储(希望是无间隙的)[Index]列。通过仅选择全部有效的行(>= 1.5),ROW_NUMBER()减法为我们获得了“分组列”-在聚合之前,结果集如下所示:

ProjectId | Index | Voltage | VoltageRun
--------: | ----: | :------ | :---------
        1 |     1 | 2.300   | 0         
        1 |     2 | 2.500   | 0         
        1 |     3 | 2.700   | 0         
        1 |     4 | 2.800   | 0         
        1 |     5 | 2.500   | 0         
        1 |     6 | 2.000   | 0         
        1 |     9 | 2.300   | 2         
        1 |    10 | 2.500   | 2         
        1 |    11 | 2.700   | 2         
        1 |    12 | 2.800   | 2         
        1 |    13 | 2.500   | 2         
        1 |    14 | 2.000   | 2         
        2 |     1 | 2.000   | 0         
        2 |     2 | 5.000   | 0         
        2 |     3 | 3.000   | 0 

[ProjectID]=1的测试数据已重复)

此后,我们只需要在原始查询中将分组列作为额外的限定符即可。
(请注意,这种类型的查询是将分组列排除在SELECT列表中的几次有意义的查询之一)