如何加入最近的记录?

时间:2014-01-20 08:16:01

标签: sql performance sas

我有两张桌子。表A包含2004年至2012年公司债券交易的每日信息,表B包含特定日期的债券评级信息。我需要加入这两个表,以便对于表A中的每个事务,附加该特定债券的最新评级。

Table A: daily_transactions
--------------------------------------------
DATE        |BOND    |PRICE
--------------------------------------------
20110401    |AES     |100
20110402    |AES     |101
20110403    |AES     |102
20110404    |AES     |103
20110401    |BPP     |99
20110402    |BPP     |98


Table B: bond_ratings
--------------------------------------------
DATE        |BOND    |RATING
--------------------------------------------
20110401    |AES     |AAA
20110403    |AES     |BB
20110401    |BPP     |CCC


Table C: joined_data
--------------------------------------------
DATE        |BOND    |PRICE   |RATING
--------------------------------------------
20110401    |AES     |100     |AAA
20110402    |AES     |101     |AAA
20110403    |AES     |102     |BB
20110404    |AES     |103     |BB
20110401    |BPP     |99      |CCC
20110402    |BPP     |98      |CCC

我有约。表A中有1,000,000条记录,表B中有14,000条记录。

更新

到目前为止我所拥有的是:

create table test_merge as
SELECT a.date, b.date, a.bond, a.price, b.rating
FROM   daily_transactions  a
LEFT   JOIN bond_ratings b ON a.bond = b.bond AND b.date <= a.date
WHERE  NOT EXISTS (
   SELECT 1 FROM bond_ratings b1
   WHERE  b1.bond = a.bond
   AND b1.date <= a.date
   AND b1.date >  b.date
   );

它似乎工作得很好(http://sqlfiddle.com/#!3/d287f/2),但是由于我拥有的数据量,它的运行速度非常慢。大约需要2个小时。有没有办法优化它以更快地运行?

我非常(非常)对sql很新,因此非常感谢任何帮助!

3 个答案:

答案 0 :(得分:1)

对于更基于SAS的方法(而不是SQL),您可以使用表格B的SAS格式,并可能加快速度。 format in SAS只是一个查找表,将START和END之间的任何内容映射到LABEL。例如,将此表格加载为格式:

fmtname   |  START       | END         | LABEL
-----------------------------------------------------------
$bondRate |  AES20110401 | AES20110403 | AAA

将START和END之间的任何文本字符串映射到LABEL。所以AES20110302 - &gt; AAA。

以下是完整代码,使用上面的表B(假设DATE是数字字段,如果不使用input(DATE,YYDDMMN8.)将其转换为数字):

PROC SORT DATA = TABLE_B;
    by bond descending date;
run;

/*Use lag function to get the start and end date on one line*/
data bond_ratings_fmt;
    set TABLE_B;
    by bond descending date;

    START_DT = put(date,$8);*Character date like '20110401';
    END_DT = put(lag(date)-1,$8);* 1 less than the prior records end;
    *first.bond is the most recent rating for each bond;
    *setting the END_DT to some future date in this case.;
    if first.bond then END_DT= '20991231';

    START = cats(BOND,START_DT);*Cats concatenates and trims spaces, makes AES20110401;
    END = cats(BOND,END_DT);
    LABEL = Rating;
    fmtName='$bondRate';    
run;
*Load the format, using CNTLIN (Control Table In);
proc format cntlin=bond_ratings_fmt;

*Apply the format;
data TableC_withRating (drop=_:);
    set TableA;
    _DateChar = put(DATE,$8.);
    Rating = put(BOND||_DateChar,$bondRate.);
run;

您可以通过在格式中添加OTHER案例来获得更多优势 - 网上有很多关于cntlinproc format的好例子。

答案 1 :(得分:0)

我怀疑在您的情况下,子查询会破坏性能。

以下方法避免了子查询使连接过程更有效。

/*sample data:*/
DATA daily_transactions;
input date bond $ price;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES|100
20110402|AES|101
20110403|AES|102
20110404|AES|103
20110401|BPP|99
20110402|BPP|98
;
run;

DATA bond_ratings;
input date bond $ rating $;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES |AAA
20110403|AES |BB
20110401|BPP |CCC
;
run;

/*Modify the bond_ratings dataset such that for each record we can specify up till when that rating is valid*/
/*essentially we will have two date fields (from_date, to_date)
from_date   bond    rating  to_date
20110401       AES      AAA     20110402
20110403       AES      BB           .
20110401       BPP      CCC          .
*/

/*since there is no LEAD function in SAS, we sort in decending order by date and apply the LAG function - in effect getting the leading value*/
PROC SORT DATA = bond_ratings OUT = bond_ratings_sorted;
by bond descending date;
run; 
/*capture the to_date by using lag function on the date.*/
data bond_ratings_lookup(rename = (date=from_date));
set bond_ratings_sorted;
by bond descending date;
format to_date yymmddn8.;
lag_date = lag(date);/*note: the reason we keep lag function outside the if-else group below because of the way lag-function works-just look it on google*/
if first.bond and first.date then to_date =.;
else to_date=lag_date-1;/*-1, so that to_date is set to 1 day less the next available bond rating date*/
drop lag_date;
run;
/*this sort is not necessary, but if you want to just verify the output then it is usefull*/
proc sort data = bond_ratings_lookup out = bond_ratings_lookup_sorted;
by bond from_date;
run;

/*final query:*/
proc sql;
create table joined as 
select a.*, b.rating, b.from_date as bond_rating_start_period, b.to_date as bond_rating_end_period
from daily_transactions as a 
left join bond_ratings_lookup_sorted as b
on a.bond = b.bond and
(
b.to_date  ne . and (a.date >=b.from_date and a.date<= b.to_date )
or
b.to_date  = . and (a.date >=b.from_date )
)
order by a.bond, a.date, b.from_date
;
quit;

答案 2 :(得分:0)

我设法通过在bond列上建立索引来将运行时间缩短到5分钟。

proc sql;
   create index bond
      on work.daily_transactions(bond);
quit;

proc sql;
   create index bond
      on work.bond_ratings(bond);
quit;