如何优化此Postgres计数查询

时间:2013-07-03 05:56:42

标签: postgresql query-optimization

EXPLAIN ANALYZE 
SELECT count(*) 
FROM "businesses" 
WHERE (
    source = 'facebook' 
    OR EXISTS( 
        SELECT * 
        FROM provider_business_map pbm 
        WHERE 
            pbm.hotstepper_business_id=businesses.id 
            AND pbm.provider_name='facebook' 
    )
);
PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=233538965.74..233538965.75 rows=1 width=0) (actual time=116169.720..116169.721 rows=1 loops=1)
   ->  Seq Scan on businesses  (cost=0.00..233521096.48 rows=7147706 width=0) (actual time=11.284..116165.646 rows=3693 loops=1)
         Filter: (((source)::text = 'facebook'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
         SubPlan 1
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.29 rows=1 width=0) (never executed)
                 Index Cond: (((provider_name)::text = 'facebook'::text) AND (hotstepper_business_id = businesses.id))
         SubPlan 2
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.28 rows=1 width=4) (actual time=0.045..5.685 rows=3858 loops=1)
                 Index Cond: ((provider_name)::text = 'facebook'::text)
 Total runtime: 116169.820 ms
(10 rows)

此查询需要一分钟时间,并且计数结果为~3000。似乎瓶颈是顺序扫描,但我不确定在数据库中需要什么索引来优化它。同样值得注意的是,我还没有调整过postgres,所以如果有任何调整可能有助于它值得考虑。虽然我的数据库是15GB而且我不打算在不久的将来把所有内容都安装到内存中,所以我不确定更改RAM相关的值会有多大帮助。

3 个答案:

答案 0 :(得分:2)

OR因糟糕的表现而臭名昭着。尝试将它拆分为两个表上两个完全独立的查询的并集:

SELECT COUNT(*) FROM (
    SELECT id
    FROM businesses 
    WHERE source = 'facebook'
    UNION   -- union makes the ids unique in the result
    SELECT hotstepper_business_id
    FROM provider_business_map
    WHERE provider_name = 'facebook'
    AND hotstepper_business_id IS NOT NULL
) x

如果hotstepper_business_id不能为空,则可以删除该行

AND hotstepper_business_id IS NOT NULL

如果您想要整个业务行,您可以使用IN (...)简单地包含上述查询:

SELECT * FROM businesses
WHERE ID IN (
    -- above inner query
)

但一个性能要好得多的查询就是修改上面的查询使用一个join:

SELECT *
FROM businesses 
WHERE source = 'facebook'
UNION
SELECT b.*
FROM provider_business_map m
JOIN businesses b
  ON b.id = m.hotstepper_business_id
WHERE provider_name = 'facebook'

答案 1 :(得分:1)

我至少尝试将依赖子查询重写为;

SELECT COUNT(DISTINCT b.*)
FROM businesses b
LEFT JOIN provider_business_map pbm
  ON b.id=pbm.hotstepper_business_id
WHERE b.source = 'facebook'
  OR pbm.provider_name = 'facebook';

除非我读错了某些内容,否则会存在businesses.id上的索引,但要确保provider_business_map.hotstepper_business_idbusinesses.sourceprovider_business_map.provider_name上还有索引才能获得最佳效果

答案 2 :(得分:1)

create index index_name on businesses(source);

由于超过700万行中有3,693行匹配,因此可能会使用该索引。别忘了

analyse businesses;