确定N组的边界

时间:2014-01-09 15:34:15

标签: sql sql-server sql-server-2008 gaps-and-islands

我花了很多时间处理以下内容:

想象一下,您有 N 组,每组记录有多条记录,每条记录都有唯一 startingending点。

换句话说:

ID|GroupName|StartingPoint|EndingPoint|seq(row_number)|desired_seq
__|_________|_____________|___________|_______________|____________
1 | Grp1    |2014-01-06   |2014-01-07 |1              |1
__|_________|_____________|___________|_______________|____________
2 | Grp1    |2014-01-07   | 2014-01-08|2              |2
__|_________|_____________|___________|_______________|____________
3 | Grp2    |2014-01-08   | 2014-01-09|1              |1
__|_________|_____________|___________|_______________|____________
4 | Grp1    |2014-01-09   | 2014-01-10|3              |1
__|_________|_____________|___________|_______________|____________
5 | Grp2    |2014-01-10   | 2014-01-11|2              |1
__|_________|_____________|___________|_______________|____________

如您所见,每个连续记录的starting point与前一个记录的ending point相同。

基本上,我想根据日期获得每组的minimumS and maximumS。一旦出现具有新组名的记录,则将其视为新组并重置排序。

row_number()函数不足以完成此任务,因为它不反映组名的变化。(我在样本数据中包含了一个seq列,表示行号生成的值)

基于样本数据的期望结果:

1  Grp1    |2014-01-06   |  2014-01-08  
2  Grp2    |2014-01-08   |  2014-01-09
3  Grp1    |2014-01-09   |  2014-01-10
4  Grp2    |2014-01-10   |  2014-01-11

我尝试过:

;with cte as(
select *
, row_number() over (partition by GroupName order by startingpoint) as seq
from table1
)
select * 
into #temp2
from cte t1
left join cte t2 on t1.id=t2.id and t1.seq= t2.seq-1

select * 
,(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MIN(seq) from #temp2) as Oldest
(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MAX(seq) from #temp2) as MostRecent
from #temp2 t1

3 个答案:

答案 0 :(得分:3)

这是子组的gaps-and-islands问题。诀窍是按两个ROW_NUMBER()值之间的差异进行分组,一个是分区的,一个是未分区的。

WITH t AS (
  SELECT
    GroupName,
    StartingPoint,
    EndingPoint,
    ROW_NUMBER() OVER(PARTITION BY GroupName ORDER BY StartingPoint)
      - ROW_NUMBER() OVER(ORDER BY StartingPoint) AS SubGroupId
  FROM #test
)
SELECT
  ROW_NUMBER() OVER (ORDER BY MIN(StartingPoint)) AS SortOrderId,
  GroupName                                       AS GroupName,
  MIN(StartingPoint)                              AS GroupStartingPoint,
  MAX(EndingPoint)                                AS GroupEndingPoint
FROM t
GROUP BY GroupName, SubGroupId
ORDER BY SortOrderId

答案 1 :(得分:0)

不确定,但也许:

SELECT DISTINCT 
    GroupName, 
    MIN(StartingPoint) OVER (PARTITION BY GroupName ORDER BY Id), 
    MAX(EndingPoint) OVER (PARTITION BY GroupName ORDER BY Id)
FROM table1

由于partition不会导致行数减少,因此原始重复的条目会被distinct删除。

答案 2 :(得分:0)

使用SQL Server 2012中的lag()功能,所以更容易。我解决这些问题的方法是找到组开始的位置,为每个组分配1或0的标志行。然后获取1 s的累积总和以获得新的组ID。

在SQL Server 2008中,您可以使用相关子查询(或联接)执行此操作:

with table1_flag as (
      select t1.*,
             isnull((select top 1 1
                     from table1 t2
                     where t2.groupname = t1.groupname and
                           t2.endingpoint = t1.startingpoint
                    ), 0) as groupstartflag
      from table1 t1
     ),
     table1_flag_cum as (
      select tf.*,
             (select sum(groupstartflag)
              from table1_flag tf2
              where tf2.groupname = tf.groupname and
                    tf2.startingpoint <= tf.startingpoint
             ) as groupnum
      from table1_flag tf
     )
select groupnum, groupname,
       min(startingpoint) as startingpoint, max(endingpoint) as endingpoint
from table1_flag_cum
group by groupnum, groupname;