Question

我有一个相当复杂的select语句建立在Chado schema的基础上，运行得有点高效。目前使用我的数据子集处理查询大约需要5秒钟。完整的数据集可能会超过一百倍，而我担心计算时间会非常慢。我被建议使用索引来提高性能，但我不完全确定会涉及到什么。

我的查询：

SELECT dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value AS isolation_source, bp2.value AS specimen_collection_date,
bp3.value AS collection_location_name, bp4.value AS genotype, refeat.seqlen, string_agg(feat.name, ', ' order by feat.name) AS tranlation_type, refeat.residues
FROM featureloc
INNER JOIN feature srcfeat ON srcfeat.feature_id = featureloc.srcfeature_id 
INNER JOIN feature feat ON feat.feature_id = featureloc.feature_id 
RIGHT JOIN dbxref ON dbxref.dbxref_id = srcfeat.dbxref_id
INNER JOIN feature refeat ON refeat.dbxref_id = dbxref.dbxref_id
INNER JOIN dbxrefprop ON dbxrefprop.dbxref_id = dbxref.dbxref_id 
INNER JOIN biomaterial ON biomaterial.dbxref_id = dbxref.dbxref_id 
INNER JOIN biomaterialprop bp1 ON (bp1.biomaterial_id = biomaterial.biomaterial_id and bp1.type_id = 2916)
INNER JOIN biomaterialprop bp2 ON (bp2.biomaterial_id = biomaterial.biomaterial_id and bp2.type_id = 2917)
INNER JOIN biomaterialprop bp3 ON (bp3.biomaterial_id = biomaterial.biomaterial_id and bp3.type_id = 2918)
INNER JOIN biomaterialprop bp4 ON (bp4.biomaterial_id = biomaterial.biomaterial_id and bp4.type_id = 2919)
INNER JOIN contact ON contact.contact_id = biomaterial.biosourceprovider_id
GROUP BY dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
HAVING (bp1.value = 'Alveolar Macrophage')
ORDER BY dbxref.accession;

解释输出（没有HAVING行）：

GroupAggregate  (cost=627.81..631.98 rows=98 width=361)
   Group Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
   ->  Sort  (cost=627.81..628.06 rows=98 width=361)
         Sort Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
         ->  Hash Join  (cost=11.42..624.57 rows=98 width=361)
               Hash Cond: (biomaterial.biosourceprovider_id = contact.contact_id)
               ->  Nested Loop  (cost=3.11..614.92 rows=98 width=316)
                     ->  Nested Loop  (cost=2.83..563.01 rows=98 width=344)
                           ->  Nested Loop  (cost=2.54..511.11 rows=98 width=332)
                                 ->  Nested Loop  (cost=2.26..459.20 rows=98 width=320)
                                       ->  Nested Loop Left Join  (cost=1.98..407.30 rows=98 width=308)
                                             ->  Nested Loop  (cost=1.12..148.51 rows=98 width=309)
                                                   ->  Merge Join  (cost=0.84..88.36 rows=164 width=312)
                                                         Merge Cond: (refeat.dbxref_id = dbxrefprop.dbxref_id)
                                                         ->  Merge Join  (cost=0.56..188.11 rows=1400 width=296)
                                                               Merge Cond: (refeat.dbxref_id = biomaterial.dbxref_id)
                                                               ->  Index Scan using feature_idx1 on feature refeat  (cost=0.29..884.42 rows=11936 width=264)
                                                               ->  Index Scan using biomaterial_idx3 on biomaterial  (cost=0.28..63.28 rows=1400 width=32)
                                                         ->  Index Scan using dbxrefprop_idx1 on dbxrefprop  (cost=0.28..60.28 rows=1400 width=16)
                                                   ->  Index Scan using dbxref_pkey on dbxref  (cost=0.29..0.36 rows=1 width=21)
                                                         Index Cond: (dbxref_id = refeat.dbxref_id)
                                             ->  Nested Loop  (cost=0.86..2.63 rows=1 width=15)
                                                   ->  Nested Loop  (cost=0.57..2.06 rows=1 width=16)
                                                         ->  Index Scan using feature_idx1 on feature srcfeat  (cost=0.29..0.46 rows=1 width=16)
                                                               Index Cond: (dbxref.dbxref_id = dbxref_id)
                                                         ->  Index Scan using featureloc_idx2 on featureloc  (cost=0.29..1.14 rows=46 width=16)
                                                               Index Cond: (srcfeature_id = srcfeat.feature_id)
                                                   ->  Index Scan using feature_pkey on feature feat  (cost=0.29..0.56 rows=1 width=15)
                                                         Index Cond: (feature_id = featureloc.feature_id)
                                       ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp1  (cost=0.28..0.52 rows=1 width=12)
                                             Index Cond: ((biomaterial_id = biomaterial.biomaterial_id) AND (type_id = 2916))
                                 ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp2  (cost=0.28..0.52 rows=1 width=12)
                                       Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2917))
                           ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp3  (cost=0.28..0.52 rows=1 width=12)
                                 Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2918))
                     ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp4  (cost=0.28..0.52 rows=1 width=12)
                           Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2919))
               ->  Hash  (cost=5.36..5.36 rows=236 width=61)
                     ->  Seq Scan on contact  (cost=0.00..5.36 rows=236 width=61)

基于我对口译解释输出的有限理解，以下几行是可疑的：

->  Index Scan using feature_idx1 on feature refeat  (cost=0.29..884.42 rows=11936 width=264)
->  Index Scan using biomaterial_idx3 on biomaterial  (cost=0.28..63.28 rows=1400 width=32)
->  Index Scan using dbxrefprop_idx1 on dbxrefprop  (cost=0.28..60.28 rows=1400 width=16)

This demonstration（在MySQL中）表示索引涉及分配其他主键以提高查找联接的效率。以下是我提议的更改：

ALTER TABLE feature
    ADD PRIMARY KEY (dbxref_id);
ALTER TABLE biomaterial
    ADD PRIMARY KEY (dbxref_id);
ALTER TABLE dbxrefprop
    ADD PRIMARY KEY (dbxref_id);

请注意，dbxref_id是引用dbxref表主键的所有三个表中的外键。这是否是改善计算时间的有效解决方案？而不是更改表，可以更改查询中的哪些行以进一步改进我的查询？带有“refeat”别名的内部连接要素表是必要的，以防止通过featureloc表链接要素的遗漏。

EDIT1

每个联接表的主键如下： featureloc = featureloc_id，feature = feature_id，dbxref = dbxref_id，dbxrefprop = dbxrefprop_id，biomaterial = biomaterial_id，biomaterialprop = biomaterialprop_id，contact = contact_id。

如果样本大小为1400，表格的行数如下： featureloc = 10536，feature = 11936，dbxref = 15492，dbxrefprop = 1400，biomaterial = 1400，biomaterialprop = 5600，contact = 236.请注意，某些表（dbxref）包含预加载的数据。

EDIT2

EXPLAIN（ANALYZE，BUFFERS）输出：

GroupAggregate  (cost=522.10..526.26 rows=98 width=361) (actual time=7899.696..10201.445 rows=1400 loops=1)
   Group Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
   Buffers: shared hit=320702 read=1752
   ->  Sort  (cost=522.10..522.34 rows=98 width=361) (actual time=7899.664..7940.350 rows=10606 loops=1)
         Sort Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
         Sort Method: quicksort  Memory: 3708kB
         Buffers: shared hit=244651 read=1752
         ->  Hash Join  (cost=11.42..518.86 rows=98 width=361) (actual time=0.406..5525.245 rows=10606 loops=1)
               Hash Cond: (biomaterial.biosourceprovider_id = contact.contact_id)
               Buffers: shared hit=171364 read=847
               ->  Nested Loop  (cost=3.11..509.20 rows=98 width=316) (actual time=0.141..5201.920 rows=10606 loops=1)
                     Buffers: shared hit=171362 read=846
                     ->  Nested Loop  (cost=2.83..457.29 rows=98 width=344) (actual time=0.138..4258.617 rows=10606 loops=1)
                           Buffers: shared hit=139485 read=821
                           ->  Nested Loop  (cost=2.54..405.39 rows=98 width=332) (actual time=0.135..3082.229 rows=10606 loops=1)
                                 Buffers: shared hit=107577 read=803
                                 ->  Nested Loop  (cost=2.26..353.48 rows=98 width=320) (actual time=0.130..2179.420 rows=10606 loops=1)
                                       Buffers: shared hit=75654 read=786
                                       ->  Nested Loop Left Join  (cost=1.98..301.58 rows=98 width=308) (actual time=0.102..1566.105 rows=10606 loops=1)
                                             Buffers: shared hit=43748 read=773
                                             ->  Nested Loop  (cost=1.12..148.51 rows=98 width=309) (actual time=0.042..332.126 rows=1400 loops=1)
                                                   Buffers: shared hit=4283 read=96
                                                   ->  Merge Join  (cost=0.84..88.36 rows=164 width=312) (actual time=0.024..208.163 rows=1400 loops=1)
                                                         Merge Cond: (refeat.dbxref_id = dbxrefprop.dbxref_id)
                                                         Buffers: shared hit=78 read=93
                                                         ->  Merge Join  (cost=0.56..188.11 rows=1400 width=296) (actual time=0.016..165.490 rows=1400 loops=1)
                                                               Merge Cond: (refeat.dbxref_id = biomaterial.dbxref_id)
                                                               Buffers: shared hit=73 read=81
                                                               ->  Index Scan using feature_idx1 on feature refeat  (cost=0.29..884.42 rows=11936 width=264) (actual time=0.008..1.471 rows=1401 loops=1)
                                                                     Buffers: shared hit=68 read=66
                                                               ->  Index Scan using biomaterial_idx3 on biomaterial  (cost=0.28..63.28 rows=1400 width=32) (actual time=0.005..118.944 rows=1400 loops=1)
                                                                     Buffers: shared hit=5 read=15
                                                         ->  Index Scan using dbxrefprop_idx1 on dbxrefprop  (cost=0.28..60.28 rows=1400 width=16) (actual time=0.005..1.018 rows=1400 loops=1)
                                                               Buffers: shared hit=5 read=12
                                                   ->  Index Scan using dbxref_pkey on dbxref  (cost=0.29..0.36 rows=1 width=21) (actual time=0.057..0.058 rows=1 loops=1400)
                                                         Index Cond: (dbxref_id = refeat.dbxref_id)
                                                         Buffers: shared hit=4205 read=3
                                             ->  Nested Loop  (cost=0.86..1.55 rows=1 width=15) (actual time=0.119..0.848 rows=8 loops=1400)
                                                   Buffers: shared hit=39465 read=677
                                                   ->  Nested Loop  (cost=0.57..1.01 rows=1 width=16) (actual time=0.061..0.511 rows=8 loops=1400)
                                                         Buffers: shared hit=8314 read=162
                                                         ->  Index Scan using feature_idx1 on feature srcfeat  (cost=0.29..0.46 rows=1 width=16) (actual time=0.057..0.110 rows=1 loops=1400)
                                                               Index Cond: (dbxref.dbxref_id = dbxref_id)
                                                               Buffers: shared hit=4203 read=3
                                                         ->  Index Scan using featureloc_idx2 on featureloc  (cost=0.29..0.48 rows=8 width=16) (actual time=0.002..0.088 rows=8 loops=1400)
                                                               Index Cond: (srcfeature_id = srcfeat.feature_id)
                                                               Buffers: shared hit=4111 read=159
                                                   ->  Index Scan using feature_pkey on feature feat  (cost=0.29..0.53 rows=1 width=15) (actual time=0.021..0.028 rows=1 loops=10536)
                                                         Index Cond: (feature_id = featureloc.feature_id)
                                                         Buffers: shared hit=31151 read=515
                                       ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp1  (cost=0.28..0.52 rows=1 width=12) (actual time=0.041..0.042 rows=1 loops=10606)
                                             Index Cond: ((biomaterial_id = biomaterial.biomaterial_id) AND (type_id = 2916))
                                             Buffers: shared hit=31906 read=13
                                 ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp2  (cost=0.28..0.52 rows=1 width=12) (actual time=0.065..0.077 rows=1 loops=10606)
                                       Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2917))
                                       Buffers: shared hit=31923 read=17
                           ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp3  (cost=0.28..0.52 rows=1 width=12) (actual time=0.042..0.050 rows=1 loops=10606)
                                 Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2918))
                                 Buffers: shared hit=31908 read=18
                     ->  Index Scan using biomaterialprop_c1 on biomaterialprop bp4  (cost=0.28..0.52 rows=1 width=12) (actual time=0.027..0.031 rows=1 loops=10606)
                           Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2919))
                           Buffers: shared hit=31877 read=25
               ->  Hash  (cost=5.36..5.36 rows=236 width=61) (actual time=0.254..0.254 rows=236 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 31kB
                     Buffers: shared hit=2 read=1
                     ->  Seq Scan on contact  (cost=0.00..5.36 rows=236 width=61) (actual time=0.003..0.129 rows=236 loops=1)
                           Buffers: shared hit=2 read=1
 Planning time: 160.551 ms
 Execution time: 10275.793 ms

可以找到当前的表结构here。我没有对架构进行任何更改。

解释说明并添加索引以改进查询

0 个答案: