Question

我正在尝试根据导入时使用的TIMESTAMPZ返回商店的最新记录。我在Postgres 9.5上，这是我从stackoverflowing得到的一些线程的查询：

select p.*
from store_products p
inner join(
   select storeid, sku, max(lastupdated) AS lastupdated
   from store_products
   group by storeid, sku
)sp on p.storeid= sp.storeidand p.lastupdated = sp.lastupdated

这为我提供了每个商店（和SKU）的最新产品，这很棒（我们有大约30家商店），但我注意到查询需要（对于6M记录） 4分钟收集数据。

因此，如果我们将此作为我的数据：

PID | StoreID | SKU | lastupdated
1   | 1       | 1a1 | 2017-02-02 18:22:30
2   | 1       | 1b1 | 2017-02-02 18:21:30
3   | 1       | 1a1 | 2017-01-16 11:22:30
4   | 2       | 1a1 | 2017-02-02 18:21:30
5   | 2       | 1a1 | 2017-02-01 18:21:00
6   | 3       | 1a1 | 2017-02-02 18:21:30
7   | 3       | 1g1 | 2017-02-01 18:21:30

我得到了这个：

PID | StoreID | SKU | lastupdated
1   | 1       | 1a1 | 2017-02-02 18:22:30
2   | 1       | 1b1 | 2017-02-02 18:21:30
4   | 2       | 1a1 | 2017-02-02 18:21:30
6   | 3       | 1a1 | 2017-02-02 18:21:30

我们是否有更好的方法可以导入这些商店快照，因此上面的查询更容易为Postgres消化 - 更快？我们应该添加任何索引吗？这是解释：

Hash Join  (cost=2358424.92..2715814.08 rows=311 width=371)
  Hash Cond: ((lp.storeid = p.storeid) AND (lp.lastupdated = p.lastupdated))
  ->  Subquery Scan on lp  (cost=1676046.30..1737513.85 rows=62125 width=12)
        ->  GroupAggregate  (cost=1676046.30..1736892.60 rows=62125 width=108)
              Group Key: store_products.storeid, store_products.sku
              ->  Sort  (cost=1676046.30..1691102.56 rows=6022505 width=108)
                    Sort Key: store_products.storeid, store_products.sku
                    ->  Seq Scan on store_products  (cost=0.00..297973.05 rows=6022505 width=108)
  ->  Hash  (cost=297973.05..297973.05 rows=6022505 width=371)
        ->  Seq Scan on store_products p  (cost=0.00..297973.05 rows=6022505 width=371)

我们的Postgres DBA正在度假，我们大多数人都不知道该怎么做。

背景故事...

我们每天从JSON的多家商店转储商店产品。每个商店由storeid决定，它们作为一个大块的JSON文件导入，包含所有商店及其产品。每个条目都有自己的lastupdated | TIMESTAMPZ字段。如果有人决定稍后更新该字段（用于审计目的），则会触发自动更新该字段的触发器。每天，这个表中插入了大约2-3K的store_products，我们目前没有对这些数据进行重复数据删除（所以价格可能已经改变，可能没有，我们似乎还不关心，我们只是插入）。我想我们很快就会重演。

让我给你一个基本的架构：

CREATE TABLE store_products
(
    id BIGINT DEFAULT PRIMARY KEY NOT NULL,
    storeid INTEGER,
    ...etc etc...
    lastupdated TIMESTAMP WITH TIME ZONE DEFAULT now()
);

存储表等商店有一个FK。

Answer 1

尝试使用ROW number -over partition by子句并使用如下的临时表

select *
from (
    select p.*
    from store_products p
    inner join (
        select
            storeid,
            max(lastupdated) AS lastupdated,
            ROW_NUMBER() OVER (PARTITION BY storedid ORDER BY lastupdated DESC) AS RowNo
        from store_products
        group by storeid
    ) sp on p.storeid= sp.storeidand p.lastupdated = sp.lastupdated
) temp
where
order by temp.RowNo

Answer 2

distinct on会更简单：

select distinct on (storeid, sku) *
from store_products
order by storeid, sku, lastupdated desc

请注意，order by子句是必需的，用于确定将返回哪一行。

如果没有足够的时间戳值得额外大小的索引，则在（storeid，sku，lastupdated）或just（storeid，sku）上创建一个索引。

按时间戳和groupid

2 个答案: