查询中的性能改进

时间:2017-03-14 10:37:55

标签: sql google-bigquery

我有以下查询在Locations表上进行自联接。当我在一百万条记录上运行此查询时,执行时间超过2小时。如果可以对此查询进行任何性能改进以便改进执行时间,那将非常感激。

SELECT
    a.Id1, a.Id2, a.LocationStart, a.LocationEnd
FROM
    Locations AS a
JOIN
    Locations AS b
ON
    a.Id1= b.Id1 AND a.Id2 = b.Id2
WHERE
    a.DateTime = (
        SELECT
            MIN(DateTime)
        FROM
            Locations
        WHERE
            Id1 = a.Id1
            AND Id2 = a.Id2)

1 个答案:

答案 0 :(得分:1)

我会观察到你的查询真的没有意义。我认为它过于简单,所以我将包括两个表引用的列。

我首先要使用窗口函数:

SELECT l.Id1, l.Id2, l2.id1, l2.id2, l.LocationStart, l.LocationEnd
FROM (SELECT l.*,
             ROW_NUMBER() OVER (PARTITION BY id1, id2 ORDER BY datetime ASC) as seqnum
      FROM Locations l
     ) l JOIN
    Locations l2
    ON l.Id1 = l2.Id1 AND l.Id2 = l2.Id2 AND l.seqnum = 1;

这假设您正在从第一个表中查找唯一值(即没有重复日期时间)。

接下来,我会发现您只想要l1字段的第一个值。你猜怎么着?您根本不需要join

select first_value(l.id1) over (partition by id1, id2 order by datetime),
       first_value(l.id2) over (partition by id1, id2 order by datetime),
       l.id1,
       l.id2,
       first_value(l.locationstart) over (partition by id1, id2 order by datetime),
       first_value(l.locationend) over (partition by id1, id2 order by datetime)    
from locations l;
相关问题