如何在SQL Teradata中找到最近的邻居?

时间:2018-10-29 12:32:13

标签: sql teradata knn

我想从表TG中找到每个客户的1个最近邻居。必须使用在val上计算出的距离来做出决定。

一种可能的解决方案是交叉连接自己的表,但是'TG'的大小为100k,初始表为5000万-达到较大的输出。所以我想到了使用窗口函数的想法:

我无法使这种算法起作用。那我该怎么办呢?

SELECT 
   cust_id2,
   MIN(CASE WHEN cust_id <> cust_id2 then cust_id end) -- to get for cust_id2 from TG another cust_id from all_custs table
   OVER (PARTITION BY cust_id2 
   ORDER BY SQRT(POWER(cur.val1 - pref.val1, 2) + POWER(cur.val2 - pref.val2, 2)) -- here I want to order by distance but I need current value and previous one. Nested windows function isn't allowed(
FROM 
(
   select all_custs.cust_id, val1, val2,  aa.cust_id2 from all_custs
   left join (sel cust_id as cust_id2 from TG) TG,  aa on aa.cust_id2 = all_custs.cust_id 
) AS dt
where cust_id2 is not null


`TG` - stores just ids - as numbers. Every cust from TG are also in `all_custs` 

Table all_custs 

cust_id (number) | val1(decimal) | val2(decimal)
_________________|_______________|_____________
123123131        | 123.1         | 2
234234241        | 75.15         | 5 
525165354        | 676.12        | 3

对于cust_id = 123123131,最接近的将是cust234234241。可能有多个val列

UPD1:供参考。这是可以通过交叉联接完成的方法,但不应这样做:

sel tg.cust_id as tg_cust_id, cg.cust_id as cg_cust_id
SQRT   (
            POWER((tg.val1 - cg.val1)/max_val1, 2) -- min = 0
          + POWER((tg.val2 - cg.val2)/max_val2, 2) -- min = 0 
       ) AS DIST
   from (
       sel arpau.cust_id, val1, val2
       from all_custs join TG aa on aa.cust_id2 = arpau.cust_id
       where arpau.branch_id = 95 
   ) tg
   join ( 
    sel arpau.cust_id, val1, Max(val1)  over(ROWS UNBOUNDED PRECEDING) max_val1
                     , val2, Max(val1)  over(ROWS UNBOUNDED PRECEDING) max_val2
            from all_custs left join TG aa on aa.cust_id2 = arpau.cust_id
       where aa.cust_id2 is null       
    ) cg  on  tg.cust_id <> cg.cust_id
QUALIFY ROW_NUMBER() OVER(PARTITION BY tg.cust_id ORDER BY DIST) = 1

0 个答案:

没有答案