Question

我有下表：

CREATE TABLE items (
  id serial
  timestamp bigint
  CONSTRAINT id_pkey PRIMARY KEY (id),
);

此表仅以附加方式使用，因此timestamp值随id增加。我需要找到timestamp最接近特定$value的行。

查询1：这需要两次全表扫描。

SELECT id FROM
  (
      (
          SELECT id, timestamp
          FROM records
          WHERE timestamp < $value
          ORDER BY timestamp DESC
          LIMIT 1
      )
      UNION ALL
      (
          SELECT id, timestamp
          FROM items
          WHERE timestamp >= $value
          ORDER BY timestamp ASC
          LIMIT 1
      )
) AS tmp
ORDER BY abs($value - timestamp)
LIMIT 1

查询2：这个似乎应该更快，但由于某种原因，它不是

SELECT id
FROM items
WHERE scan.gpstimestamp >= $value
ORDER BY id ASC 
LIMIT 1

问题3：我正在试验需要全表扫描的自定义聚合，但不需要对任何内容进行排序或加载任何索引。

create function closest_id_sfunc(
  agg_state bigint[2],
  id bigint,
  timestamp bigint,
  target_timestamp bigint
)
returns bigint[2]
immutable
language plpgsql
as $$
declare
  difference bigint;
begin
  difference := abs(timestamp - target_timestamp);
  if agg_state is null or difference < agg_state[0] then
    agg_state[0] = difference;
    agg_state[1] = id;
  end if;
  return agg_state;
end;
$$;

create function closest_id_finalfunc(agg_state bigint[2])
returns bigint
immutable
strict
language plpgsql
as $$
begin
  return agg_state[1];
end;
$$;

create aggregate closest_id (bigint, bigint, bigint)
(
    stype     = bigint[2],
    sfunc     = closest_id_sfunc,
    finalfunc = closest_id_finalfunc
);


SELECT closest_id(id, timestamp, $value) as id FROM items

为什么查询2比查询1慢？

Answer 1

您的第二个查询将无效，因为在提供的时间戳之前可能会有一行，它更接近提供的值。准确性不是这里唯一关注的问题：可能没有一行，根本不是提供的时间戳（同时存在较低的值）。

您的第一个查询看起来效率很高（当您在子查询中使用limit 1时）。但是，是的，它需要两个表扫描，当你没有索引，但你无法解决。您需要索引以获得巨大的性能提升。然而，有一些技巧可以使用。

我最初的想法是，您可以通过使用条件来避免外部查询的排序成本：

（注意：我会使用ts作为列名，因为timestamp是一个关键字＆amp;不应该用作列名，除非它被转义。）

with l as (
  select   id, ts
  from     items
  where    ts < $value
  order by ts desc
  limit    1
),
g as (
  select   id, ts
  from     items
  where    ts >= $value
  order by ts asc
  limit    1
)
select    case
            when abs($value - l.ts) < abs($value - g.ts)
            then l.id
            else coalesce(g.id, l.id)
          end id
from      l
full join g on true

然而，这只会在我的测试中造成微小的性能提升（看起来PostgreSQL对于仅排序两行非常聪明）。

您可以通过对PostgreSQL的某些几何类型使用直接“距离”计算来加快查询速度。注意：这些类型通常使用double precision作为值，因此它们可能包含舍入错误。如果您的值是真正的unix时间戳（bigint），则很可能不会出现问题。

以下是在point上使用始终可用的<->类型的距离运算符point(ts, 0)的查询（因此第二个坐标始终为零）：

select   id
from     items
order by point(ts, 0) <-> point($value, 0)
limit    1

在我的测试中，这需要原始查询（或CTE变体）的约70％。

您还可以使用cube module's cube类型＆amp; <->上的（欧几里得）距离算子cube(ts)（ 9.6 + 特征）（因此立方体将始终是一维点）：

select   id
from     items
order by cube(ts) <-> cube($value)
limit    1

这与速度上的point变体相当。当你使用索引时，它会有一些差异。

（注意：您可以使用create extension cube;初始化模块。）

<强>索引

所以，有趣的部分：

您的原始查询（或CTE变体）可以使用以下（覆盖）索引：

create index idx_items_ts_id on items (ts, id)

有了这个，您的原始查询（和CTE变体）使用仅索引扫描，其成本约为同一查询的1.5％（没有索引）。

point变体可以使用以下GiST索引：

（注意：btree_gist模块是id成为索引的一部分。您可以使用create extension btree_gist;初始化模块。）

create index idx_items_point_gist on items using gist (point(ts, 0), id)

这样，point变体的成本约为原始查询的1％（没有索引）。

cube变体可以使用以下GiST索引：

（注意：这也需要btree_gist模块。）

create index idx_items_cube_gist on items using gist (cube(ts), id)

同样，这仍然与point变体相当。

结论（请参阅稍后编辑）

使用point或cube（后者需要9.6+），您可以获得最佳性能。此外，索引可以帮助你很多。

补充说明：

point变体实际上有时更快（比cube变体）
PostgreSQL花了很长时间来构建cube索引＆amp;我不知道为什么
理论上，cube索引应该更小，因为它不包含不必要的零。但是，因为它们更普遍（N维），我可能不对此。我建议试试这两个＆amp;衡量（指数大小和绩效）。

http://rextester.com/KNY52367（这些查询也在cube，但不会运行，因为rextester现在使用9.5。）

另外，我也测试了一个自定义聚合解决方案（基本上是你的版本，但我使用language sql函数加速了一点，但仍然），它比原始查询慢了~10倍。恕我直言，这根本不值得。 http://rextester.com/PLG94853

修改：注意，btree_gist模块为基本类型（例如<->）添加了对距离运算符bigint的支持。

因此，即使是point和cube变体，此查询也会超越（稍微）：

select   id
from     items
order by ts <-> $value
limit    1

此索引最适合上述查询：

create index idx_items_ts_gist on items using gist (ts, id)

http://rextester.com/XUF56126

具有最接近列值的行

1 个答案: