总结区间数据的最佳方法是什么?

时间:2014-06-23 14:38:26

标签: sql-server sql-server-2008-r2

给定具有任意间隔(非日期/时间!!)的表中的数据定义如下:

START float
END float
VALUE varchar(40)

E.g。

 START    END    VALUE
 -----    ---    ------
 0        1      Banana
 1        3      Banana
 3        4      Orange
 4        7      Orange
 7        8      Apple
 8        9      Apple
 9       10      Apple
10       15      Apple
20       22      Apple
22       23      Apple
23       28      Banana
28       30      Banana
etc..

如何汇总数据,以便对于连续间隔,仅列出一个值。即查询的结果应如下所示:

START     END    VALUE
-----     ---    ------
 0        3      Banana
 3        7      Orange
 7       15      Apple
20       23      Apple
23       30      Banana

注意上面15和20之间的差距。我正在处理大量数据(~500k行),但不经常运行查询。所以效率很高。这可以在不使用游标的情况下完成吗?

(注意:使用SQL2008R2所以不能利用更新的功能,如果存在的话)

谢谢!

3 个答案:

答案 0 :(得分:3)

这应该适合你:

DECLARE @T TABLE (Start INT, [End] INT, Value VARCHAR(100));
INSERT @T (Start, [End], Value)
VALUES
    (0, 1, 'Banana'), (1, 3, 'Banana'), (3, 4, 'Orange'), (4, 7, 'Orange'),
    (7, 8, 'Apple'), (8, 9, 'Apple'), (9, 10, 'Apple'), (10, 15, 'Apple'), 
    (20, 22, 'Apple'), (22, 23, 'Apple'), (23, 28, 'Banana'), (28, 30, 'Banana');

WITH CTE AS
(   SELECT  t.[Start], 
            t.[End], 
            t.[value], 
            IsStart = ISNULL(c.IsStart, 1)
    FROM    @T AS T
            OUTER APPLY
            (   SELECT  TOP 1 IsStart = 0
                FROM    @T AS T2
                WHERE   T2.Value = T.Value
                AND     T2.[End] = T.Start
            ) AS c
)
SELECT  Value, Start = MIN(Start), [End] = MAX([End])
FROM    CTE AS T
        OUTER APPLY
        (   SELECT  SUM(IsStart)
            FROM    CTE AS T2
            WHERE   T2.Value = T.Value
            AND     T2.Start <= T.Start
        ) g (GroupingSet)
GROUP BY Value, GroupingSet
ORDER BY Start;

第一步是识别作为新范围开始的每条记录。这部分:

SELECT  t.[Start], 
        t.[End], 
        t.[value], 
        IsStart = ISNULL(c.IsStart, 1)
FROM    @T AS T
        OUTER APPLY
        (   SELECT  TOP 1 IsStart = 0
            FROM    @T AS T2
            WHERE   T2.Value = T.Value
            AND     T2.[End] = T.Start
        ) AS c

会给:

Start   End value   IsStart
0       1   Banana  1
1       3   Banana  0
3       4   Orange  1
4       7   Orange  0
7       8   Apple   1
8       9   Apple   0
9       10  Apple   0
10      15  Apple   0
20      22  Apple   1

然后,您可以通过识别在当前记录之前开始的范围数来创建唯一组,实际上是添加按值分区的IsStart列的运行总计。这是在这里做的:

SELECT  *
FROM    CTE AS T
        OUTER APPLY
        (   SELECT  SUM(IsStart)
            FROM    CTE AS T2
            WHERE   T2.Value = T.Value
            AND     T2.Start <= T.Start
        ) g (GroupingSet);

,并提供:

Start   End value   IsStart GroupingSet
0       1   Banana  1       1
1       3   Banana  0       1
3       4   Orange  1       1
4       7   Orange  0       1
7       8   Apple   1       1
8       9   Apple   0       1
9       10  Apple   0       1
10      15  Apple   0       1
20      22  Apple   1       2   -- SECOND NON CONTINUOUS RANGE FOR APPLES
22      23  Apple   0       2
23      28  Banana  1       2   -- SECOND NON CONTINUOUS RANGE FOR BANANAS
28      30  Banana  0       2

最后,您可以按值聚合分组,并使用此标识符列来标识唯一的组。

您也可以通过交叉连接到数字表格将每个范围扩展到行中来实现这一点(为简洁起见,我使用了master..spt_values):

WITH CTE AS
(   SELECT  t.[value], 
            Number = t.Start + v.Number,
            GroupingSet = t.Start + v.Number - ROW_NUMBER() OVER(PARTITION BY t.[value] ORDER BY t.Start + v.Number)
    FROM    @T AS T
            INNER JOIN Master..spt_values v
                ON v.[Type] = 'P'
                AND v.Number < (t.[End] - t.[Start])
)
SELECT  Value, [Start] = MIN(Number), [End] = MAX(Number)
FROM    CTE
GROUP BY GroupingSet, Value;

如果你有很多行/大范围,那么它的垮台就是内存密集。扩展范围后,这只使用Itzik Ben-Gan's Gaps and Islands Solutions

中描述的排名函数的方法

答案 1 :(得分:1)

使用SQLServer 2008,一种方法是使用三角形连接,稍加扭曲

WITH I AS (
  SELECT ID = Row_Number() OVER (ORDER BY Start)
       , _Start = [Start]
       , _End = [End]
       , Value
  FROM   Data
), D AS (
  SELECT i.ID, i._Start, i._End, i.Value
       , m.id _id, m.value _value
       , R = CASE WHEN i.Value <> m.Value THEN 1 
                  WHEN m._End <> i._Start THEN 1 
                  ELSE 0 
             END
  FROM   I
         CROSS APPLY (SELECT TOP 1
                             id, _Start, _End, value
                      FROM   I m
                      WHERE  m.ID IN (i.ID, i.ID - 1)
                      ORDER BY ID) m
), B AS (
  SELECT i.ID, i._Start, i._End, i.Value
       , R = SUM(l.R)
  FROM   D i
         LEFT  JOIN D l ON i.id >= l.id
  GROUP BY i.ID, i._Start, i._End, i.Value
)
SELECT [START] = MIN(_Start)
     , [END] = MAX(_End)
     , Value
FROM   B
GROUP BY R, Value
ORDER BY 1

SQLFiddle Demo

CTE I (ID)会创建一个ID,只要后续两行之间有间隙(ID用于获取JOIN)中的正确行。

CTE D (数据)使用CROSS APPLY获取上一行(或第一行的相同行),这是相同的LAG的{​​{1}},检查前一行的值,以查看Value是否已更改,或者当前[START]与前一个[END]之间是否存在差距。< / p>

CTE B (阻止)使用JOIN与其自身之间的三角形D创建一个字段,其中存储的数量为从开始到当前行的变化和差距,该字段对于同一组数据具有相同的数字。

主查询使用该新列来聚合数据。

答案 2 :(得分:1)

WITH TableWithPreviousAndNext AS (
    SELECT CA1.[Previous]
          ,Table1.[Start]
          ,Table1.[End]
          ,CA2.[Next]
          ,Table1.[Value]
          ,(1 + ROW_NUMBER() OVER (PARTITION BY [Value] ORDER BY Table1.[Start])) / 2 AS [Group]
    FROM Table1
         CROSS APPLY (
             SELECT MAX([End]) AS [Previous]
             FROM Table1 AS InnerTable1
             WHERE InnerTable1.[Value] = Table1.[Value]
                   AND InnerTable1.[Start] < Table1.[Start]
         ) AS CA1
         CROSS APPLY (
             SELECT MIN([Start]) AS Next
             FROM Table1 AS InnerTable1
             WHERE InnerTable1.[Value] = Table1.[Value]
                   AND InnerTable1.[Start] > Table1.[Start]
         ) AS CA2
        CROSS APPLY ( -- A little trick to create a 2 row group for isolated rows
            SELECT 1 AS Dummy
          UNION ALL
            SELECT 1
            WHERE ([Previous] IS NULL OR [Previous] <> [Start])
                  AND ([Next] IS NULL OR [Next] <> [End])
        ) AS CA3
    WHERE [Previous] IS NULL -- Remove all but first and last in sequence
          OR [Next] IS NULL
          OR [Previous] <> [Start]
          OR [End] <> [Next]
)
SELECT MIN([Start])
      ,MAX([End])
      ,[Value]
FROM TableWithPreviousAndNext
GROUP BY [Value]
        ,[Group]
ORDER BY MIN(Start)