SQL中两组日期范围的比较

时间:2011-01-04 23:26:40

标签: sql db2 sas

我有两组具有不同日期范围的数据。

Tbl 1:  
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-09
1, 2010-01-10, 2010-01-19
1, 2010-01-30, 2010-01-31

Tbl 2:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-04
1, 2010-01-08, 2010-01-17
1, 2010-01-30, 2010-01-31

我想查找案例日期范围与Tbl 2中的日期范围不完全重叠。例如,在本例中,我希望输出看起来像这样 -

Output:
ID, Gap_Start, Gap_End
1, 2010-01-05, 2010-01-07
1, 2010-01-18, 2010-01-19

日期范围永远不会在表格内重叠。为此,我使用的是DB2 SQL或SAS。不幸的是,数据集足够大(数百万条记录),我不能强迫它。

谢谢!

4 个答案:

答案 0 :(得分:1)

我不认为所有案例都有效率和通用的解决方案。但是,在某些情况下,我们可以找出一些有效的方法。例如,下面假设:(1)数据集1和2具有相同顺序的id组; (2)可能的日期范围相对较短(此处假定为2010年的所有日期)。请注意,一个输入范围可能会产生两个间隙。

/* test data */
data one;
  input id1 (start1 finish1) (:anydtdte.);
  format start1 finish1 e8601da.;
cards;
1 2010-01-01 2010-01-09
1 2010-01-10 2010-01-19
1 2010-01-30 2010-01-31
2 2010-01-02 2010-01-10
;
run;

data two;
  input id2 (start2 finish2) (:anydtdte.);
  format start2 finish2 e8601da.;
cards;
1 2010-01-01 2010-01-04
1 2010-01-08 2010-01-17
1 2010-01-30 2010-01-31
2 2010-01-05 2010-01-06
;
run;


/* assumptions:
   (1) datasets one and two have the same set of ids in the same
       sorted order;
   (2) only possible dates are in the year of 2010
*/
%let minDate = %sysevalf('01jan2010'd - 1);
%let maxDate = %sysevalf('31dec2010'd + 1);

data gaps;

  array inRange[&minDate:&maxDate] _temporary_;
  array covered[&minDate:&maxDate] _temporary_;
  do i = &minDate to &maxDate; inRange[i] = 0; covered[i] = 0; end;

  do until (last.id1);
    set one;
    by id1;
    do i = start1 to finish1; inRange[i] = 1; end;
  end;

  do until (last.id2);
    set two;
    by id2;
    do i =  start2 to finish2; covered[i] = 1; end;
  end;

  format startGap finishGap e8601da.;
  startGap = .;
  finishGap = .;
  do i = &minDate+1 to &maxDate;
    if inRange[i] and not covered[i] and missing(startGap) then startGap = i;
    if (covered[i] or not inRange[i]) and not missing(startGap) and not covered[i-1] then do;
      finishGap = i - 1;
      output;
      call missing(startGap, finishGap);
      keep id1 startGap finishGap;
    end;
  end;     
run;

/* check */
proc print data=gaps noobs;
run; 
/* on lst 
id1     startGap     finishGap

 1     2010-01-05    2010-01-07
 1     2010-01-18    2010-01-19
 2     2010-01-02    2010-01-04
 2     2010-01-07    2010-01-10
*/

答案 1 :(得分:1)

这不是一个完整的解决方案,因为它返回一个日期列表而不是范围,但它可能会有用:

SELECT
  R1.ID, D.Date
FROM
  #Ranges1 AS R1
  INNER JOIN Dates AS D ON D.Date BETWEEN R1.StartDate AND R1.EndDate
EXCEPT
SELECT
  R2.ID, D.Date
FROM
  #Ranges2 AS R2
  INNER JOIN Dates AS D ON D.Date BETWEEN R2.StartDate AND R2.EndDate

请注意,此解决方案需要一个日期表:一个表,每天有一条记录,适用于您可能使用的所有日期。它具有简洁,处理重叠日期范围的优点(在您的情况下不是必需的,但可能是下一个人)。

答案 2 :(得分:1)

继Jon of All Trades的方法之后,这是一个更完整的解决方案。关键特征是:

  1. 使用auxiliary calendar table,这只是所有日期的列表。
  2. 从日历表中,JOIN到Tbl1以获取范围内的日期列表。
  3. 同时对{Tbl2}进行anti-JOIN以获取不在Tbl2范围内的日期。
  4. 我已将这些结果括在一个名为OutDates的Common Table Expression (CTE)中。
  5. 根据OutDates定义另一个CTE,以获得开始差距的日期;叫这个EarliestDates。
  6. 根据OutDates定义另一个CTE,以获得结束差距的日期;请致电此最新日期。
  7. 加入EarliestDates和LatestDates,将每个差距划分为一行。

  8. WITH
    OutDates(ID, dt) AS
    ( SELECT Tbl1.ID, Calendar.dt FROM Calendar
    INNER JOIN Tbl1 ON Calendar.dt BETWEEN Tbl1.Date_Start AND Tbl1.Date_End
    LEFT OUTER JOIN Tbl2 ON Calendar.dt BETWEEN Tbl2.Date_Start AND Tbl2.Date_End
    WHERE Tbl2.ID IS NULL
    )
    ,
    EarliestDates AS
    (   SELECT earliest.ID, earliest.dt FROM OutDates earliest
        LEFT OUTER JOIN OutDates nonesuch_earlier ON DateAdd(day, -1, earliest.dt) = nonesuch_earlier.dt
        WHERE nonesuch_earlier.ID IS NULL
    )
    ,
    LatestDates AS
    (   SELECT latest.ID, latest.dt FROM OutDates latest
        LEFT OUTER JOIN OutDates nonesuch_later ON DATEADD(day, 1, latest.dt) = nonesuch_later.dt
        WHERE nonesuch_later.ID IS NULL
    )
    SELECT rangestart.ID, rangestart.dt AS Gap_Start, rangeend.dt AS Gap_End 
     FROM EarliestDates rangestart JOIN LatestDates rangeend
     ON rangestart.dt <= rangeend.dt
    LEFT OUTER JOIN EarliestDates nonesuch_inner1
     ON nonesuch_inner1.dt <= rangeend.dt AND nonesuch_inner1.dt > rangestart.dt 
    LEFT OUTER JOIN LatestDates nonesuch_inner2
     ON nonesuch_inner2.dt >= rangestart.dt AND nonesuch_inner2.dt < rangeend.dt
    WHERE nonesuch_inner1.dt IS NULL AND nonesuch_inner2.dt IS NULL
    

    这是一个使用Sql Server语法处理公用表表达式的工作实现,但它应该很容易转换为DB2语法。我不知道它的规模如何,说实话,我只用一个非常小的数据集测试它。

答案 3 :(得分:0)

对于它的价值,这是我最终使用的方法。我认为你可以在纯SQL中做到这一点,但它变得非常丑陋且难以调试。

第1步 - 我合并了两个数据集中的日期范围。这意味着像

这样的东西
ID, Start_Date, End_Date
1,  2010-01-01, 2010-01-31
1,  2010-02-01, 2010-02-28

变成了这个 -

ID, Start_Date, End_Date
1,  2010-01-01, 2010-02-28.

我以前用来产生这个问题的查询是 -

WITH Cte_recomb (Id, Start_date, End_date, Hopcount) AS
        (SELECT Id,
                Start_date,
                End_date,
                1 AS Hopcount
         FROM Table1
         UNION ALL
         SELECT Cte_recomb.Id,
                Cte_recomb.Start_date,
                Table1.End_date,
                (Recomb.Hopcount + 1) AS Hopcount
         FROM Cte_recomb, Table1
         WHERE (Cte_recomb.Id = Table1.Id) AND
               (Cte_recomb.End_date + 1 day = Table1.Start_date)),
     Cte_maxenddate AS
        (SELECT Id,
                Start_date,
                Max (End_date) AS End_date
         FROM Cte_recomb
         GROUP BY Id, Start_date
         ORDER BY Id, Start_date)
SELECT Maxend.*
FROM    Cte_maxenddate AS Maxend
     LEFT JOIN
        Cte_recomb AS Nextrec
     ON (Nextrec.Id = Maxend.Id) AND
        (Nextrec.Start_date < Maxend.Start_date) AND
        (Nextrec.End_date >= Maxend.End_date)
WHERE Nextrec.Id IS NULL;

第2步 -

我制作了另一个数据集,为两个数据集之间的每个重叠创建了一条记录。您需要一个额外的步骤来查找Table1中给定记录根本没有Table2中相应记录的情况。

SELECT Table1.Id,
       Table1.Start_date AS Table1_start_date,
       Table1.End_date AS Table1_end_date,
       Table2.Start_date AS Table2_start_date,
       Table2.End_date AS Table2_end_date
FROM    Table1
     INNER JOIN
        Table2
     ON (Table1.Plcy_id_sk = Id) AND
        ( (Table1.Start_date BETWEEN Table2.Start_date AND Table2.End_date) OR
         (Table2.Start_date  BETWEEN Table1.Start_date AND Table1.End_date)) AND
        ( (Table1.Start_date <> Table2.Start_date) OR
         (Table1.End_date    <> Table2.End_date))
ORDER BY Table1.Id, Table1.Start_date, Table2.Start_date;

第3步 -

我使用上面的数据集,并运行以下SAS作业。我尝试在纯SQL中使用递归查询来执行此操作,但每次查看它时都会变得更加丑陋和丑陋。

Data Table1_Gaps;
  Set Table1_Compare;
  By ID Table1_Start_Date Table2_Start_Date;
  format Gap_Start_Date yymmdd10.;
  format Gap_End_Date   yymmdd10.;
  format Old_Start_Date yymmdd10.;
  format Old_End_Date   yymmdd10.;
  Retain Old_Start_Date Old_End_Date;
  IF (Table2_End_Date = .) then do;
      Gap_Start_Date = Table1_Start_Date;
      Gap_End_Date   = Table1_End_Date;
      output;
  end;
  else do;
    If (Table2_Start_Date > Table1_Start_Date) then do;
      if first.Table1_Start_Date then do;
        Gap_Start_Date = Table1_Start_Date;
        Gap_End_Date   = Table2_Start_Date - 1;
        output;
      end;
      else do;
        Gap_Start_Date = Old_End_Date + 1;
        Gap_End_Date   = Table2_Start_Date - 1;
        output;
      end;
    end;
    If (Table2_End_Date < Table1_End_Date) then do;
      if Last.Table1_Start_Date then do;
        Gap_Start_Date = Table2_End_Date + 1;
        Gap_End_Date   = Table1_End_Date;
        output;
      end;
    end;
  end;
  Old_Start_Date = Table2_Start_Date;
  Old_End_Date   = Table2_End_Date;
  drop Old_Start_Date Old_End_Date;
run;

我还没有完全验证它,但这种方法似乎确实给了我想要的结果。有什么想法吗?