优化从同一个表中提取的多个列的查询

时间:2011-05-09 14:20:18

标签: mysql sql query-optimization

这是another question here on SO的后续行动。

我有两个数据库表(省略了更多的表):

acquisitions (acq)
    id {PK}
    id_cu {FK}
    datetime
    { Unique Constraint: id_cu - datetime }

data
    id {PK}
    id_acq {FK acquisitions}
    id_meas
    id_elab
    value

每个可能的iddatetime 全部已编入索引。

当然,我将更改数据库结构我需要以这种方式提取数据:

  • 按日期时间分组的行
  • 所选data.value组合的每列对应acq.id_cu - data.id_meas - data.id_elab。 (见帖子底部的注释)
  • 如果某些列的数据缺失但在日期时间存在其他列
  • ,则允许空单元格

我当前的查询以这种方式构建(请参阅SO question):

SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acq.datetime AS datetime, data.value AS v1, NULL AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 3 AND data.id_meas = 2 AND data.id_elab = 1

UNION

SELECT acq.datetime AS datetime, NULL AS v1, data.value AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6

UNION

SELECT acq.datetime AS datetime, NULL AS v1, NULL AS v2, data.value AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8

) AS T
WHERE datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

这里仅检索3列,但正如我所说,列通常超过50列。

它完美无缺,但我想知道它是否可以在速度上进行优化。

对于上面的查询,这是MySQL EXPLAIN EXTENDED

+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
| id | select_type  | table        | type | possible_keys                                  | key                   | key_len | ref                    | rows  | filtered | Extra                                        |
+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL  | NULL                                           | NULL                  | NULL    | NULL                   | 82466 |   100.00 | Using where; Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | id_cu                 | 4       |                        | 18011 |   100.00 |                                              |
|  2 | DERIVED      | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
|  3 | UNION        | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | ix_acquisitions_id_cu | 4       |                        | 20864 |   100.00 |                                              |
|  3 | UNION        | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
|  4 | UNION        | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | id_cu                 | 4       |                        | 31848 |   100.00 |                                              |
|  4 | UNION        | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
| NULL | UNION RESULT | <union2,3,4> | ALL  | NULL                                           | NULL                  | NULL    | NULL                   |  NULL |     NULL |                                              |
+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
8 rows in set, 1 warning (8.24 sec)

目前有(编辑:今天检查)390k采集和9.2M数据值(并且正在增长)需要大约 10分钟来提取59列的表格。我知道先前的软件需要1个小时来提取数据。

感谢您耐心阅读,直至此处:)


更新

在Denis回答后,我尝试了他的更改1.和2.,这是新查询的结果:

SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acq.datetime AS datetime, data.value AS v1, NULL AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 3 AND data.id_meas = 2 AND data.id_elab = 1
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

UNION ALL

SELECT acq.datetime AS datetime, NULL AS v1, data.value AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

UNION ALL

SELECT acq.datetime AS datetime, NULL AS v1, NULL AS v2, data.value AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

) AS T GROUP BY datetime

这里是新的EXPLAIN EXTENDED

+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type  | table        | type  | possible_keys                                                | key            | key_len | ref                    | rows  | filtered | Extra                           |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL   | NULL                                                         | NULL           | NULL    | NULL                   | 51997 |   100.00 | Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 14827 |   100.00 | Using where                     |
|  2 | DERIVED      | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  3 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 18663 |   100.00 | Using where                     |
|  3 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  4 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 13260 |   100.00 | Using where                     |
|  4 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                                                         | NULL           | NULL    | NULL                   |  NULL |     NULL |                                 |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
8 rows in set, 1 warning (3.01 sec)
毫无疑问,

表现良好的表现


更新(2)

这会添加点3.

EXPLAIN EXTENDED SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acquisitions.datetime AS datetime, MAX(data.value) AS v1, NULL AS v2, NULL AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 1 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

UNION ALL

SELECT acquisitions.datetime AS datetime, NULL AS v1, MAX(data.value) AS v2, NULL AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 4 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

UNION ALL

SELECT acquisitions.datetime AS datetime, NULL AS v1, NULL AS v2, MAX(data.value) AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 8 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

) AS T GROUP BY datetime;

这是EXPLAIN EXTENDED

的结果
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type  | table        | type  | possible_keys                                                | key            | key_len | ref                    | rows  | filtered | Extra                           |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL   | NULL                                                         | NULL           | NULL    | NULL                   | 51997 |   100.00 | Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 14827 |   100.00 | Using where                     |
|  2 | DERIVED      | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  3 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 18663 |   100.00 | Using where                     |
|  3 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  4 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 13260 |   100.00 | Using where                     |
|  4 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                                                         | NULL           | NULL    | NULL                   |  NULL |     NULL |                                 |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
8 rows in set, 1 warning (3.06 sec)

稍慢一点,这应该是从大量的coulmns中受益吗?我会试试......


更新(3)

我尝试使用和不使用MAX(data.value)... GROUP BY datetime,在60列查询中,我获得了更好的结果 。结果因尝试而异,这是其中之一。

  • 原始查询9m12.144s
  • 与Denis'1.2. 4m6.597s
  • 与Denis'1.2.3. 4m0.210s

所需时间减少约57%。


更新(4)

我尝试过Andiry解决方案,但它比Denis优化要慢。

测试了3个组合 / columns:

  • 未优化:1m3s
  • Denis'优化:1.7s
  • Andiry的CASE:9.3s

我还测试了12个组合 / columns:

  • 未经优化:未经测试
  • Denis'优化:3.6s
  • Andiry的CASE:13.7s

此外,Andiry的解决方案还包括收购日期,其中没有任何所选组合的数据,但存在于其他组合。

Immagine控制单元1每隔30分钟在00和30获取数据,而控制单元2在:15和45:我将使用NULL填充空行的行数加倍。


注意:

所有关于传感器系统:有几个控制单元<​​/ strong>(每个id_cu一个),每个传感器

单个传感器由id_cu / id_meas对标识,并为每个度量发送不同的详细说明,例如MIN(id_elab=1),MAX(id_elab=2), AVERAGE(id_elab=3),INSTANT(id_elab=...)等,每个id_elab一个。

用户可以自由地接收他想要的许多详细说明,例如:

  • 结果列的控制单元#1的传感器#3的平均值(3)id_cu=1 / id_meas=3 / id_elab=3
  • 结果列的控制单元#1的传感器#5的平均值(3)id_cu=1 / id_meas=5 / id_elab=3
  • 另一列控制单元#4的传感器#2的MIN值(1)id_cu=4 / id_meas=2 / id_elab=1
  • (放置任何有效的id_cu, id_meas, id_elab组合)
  • ...

等等,多达数十种选择......

这是部分DDL(不包括不相关的表格):

CREATE TABLE acquisitions (
    id INTEGER NOT NULL AUTO_INCREMENT, 
    id_cu INTEGER NOT NULL, 
    datetime DATETIME NOT NULL, 
    PRIMARY KEY (id), 
    UNIQUE (id_cu, datetime), 
    FOREIGN KEY(id_cu) REFERENCES ctrl_units (id) ON DELETE CASCADE
)

CREATE TABLE data (
    id INTEGER NOT NULL AUTO_INCREMENT, 
    id_acq INTEGER NOT NULL, 
    id_meas INTEGER NOT NULL, 
    id_elab INTEGER NOT NULL, 
    value FLOAT, 
    PRIMARY KEY (id), 
    FOREIGN KEY(id_acq) REFERENCES acquisitions (id) ON DELETE CASCADE
)

CREATE TABLE ctrl_units (
    id INTEGER NOT NULL, 
    name VARCHAR(40) NOT NULL, 
    PRIMARY KEY (id)
)

CREATE TABLE sensors (
    id_cu INTEGER NOT NULL, 
    id_meas INTEGER NOT NULL, 
    id_elab INTEGER NOT NULL, 
    name VARCHAR(40) NOT NULL, 
    `desc` VARCHAR(80), 
    PRIMARY KEY (id_cu, id_meas), 
    FOREIGN KEY(id_cu) REFERENCES ctrl_units (id) ON DELETE CASCADE
)

3 个答案:

答案 0 :(得分:3)

主要有三个问题:

  1. 使用union all,而不是union。您正在对最小/最大值进行分组和提取,因此引入排序步骤以删除重复行没有意义。

  2. where子句可以放在每个联合子语句中:

    select ...
    from (
    select ... from ...  where ...
    union all
    select ... from ...  where ...
    union all
    ...
    )
    group by ...
    

    你编写它的方式,首先是获取所有行,然后将它们全部附加,最后过滤掉你需要的行。在union子语句中注入where子句将使其仅获取所需的行,最后将它们全部附加。

  3. 同样,预聚合聚合:

    select ..., max(foo) as foo
    from (
    select ..., max(foo) as foo from ...  where ... group by ...
    union all
    select ..., max(foo) as foo from ...  where ... group by ...
    union all
    ...
    )
    group by ...
    

    优化器将更好地利用现有索引,并且最终只会添加几行,而不是数百万行。

答案 1 :(得分:1)

SELECT
  acq.datetime,
  MAX(CASE WHEN acq.id_cu = 2 AND data.id_meas = 2 AND data.id_elab = 1 THEN data.value END) AS v1,
  MAX(CASE WHEN acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6 THEN data.value END) AS v2,
  MAX(CASE WHEN acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8 THEN data.value END) AS v3
FROM acq
  INNER JOIN data acq.id = data.id_acq
WHERE datetime >= 2011-03-01 00:00:00 AND datetime <= 2011-04-30 23:59:59
GROUP BY acq.datetime

这可能看起来与原始查询大致相同,但主要区别在于逻辑上它只扫描一次表而不是三次或多次使用UNIONs。

答案 2 :(得分:0)

基本上我认为使用单个SELECT和CASE处理条件会得到更好的结果。无论如何,您可能想要进行基准测试和比较......

SELECT acq.datetime AS datetime, 
       MAX(
           CASE acq.id_cu
           WHEN 1 THEN data.value
           END 
       ) as v1,
       MAX(
           CASE acq.id_cu
           WHEN 4 THEN data.value
           END 
       ) as v2,
       MAX(
           CASE acq.id_cu
           WHEN 8 THEN data.value
           END 
       ) as v3
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_meas = 1 AND data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59"

这应该进行清洁范围扫描。 此外,复合索引可以做得更多。

最后,使用GROUP BY有什么问题,例如

SELECT data.id_means, acq.datetime AS datetime, MAX(data.value)
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59" AND
       data.id_means IN (1,4,8)
GROUP BY
       data.id_means

这是最简单的形式(也是最灵活的) - 即使行没有为您调换行(对于data.id_meas的不同值)。但是,这将使您最好地了解期望的性能以及哪些索引对查询最有用。

修改 要获得* acq.id_cu的最大数据值 - data.id_meas - data.id_elab组合*您应该能够使用

SELECT 
       acq.id_cu, data.id_meas, data.id_elab, acq.datetime AS datetime, MAX(data.value)
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59" AND
       data.id_means IN (1,4,8)
GROUP BY
       acq.id_cu, data.id_meas, data.id_elab, acq.datetime

将为acq.id_cu, data.id_meas, data.id_elab, acq.datetime的所有组合提供max(data.value)(过滤后使用其中的值 - 调整影响结果的位置)。 对于没有行的组合,这不会显示NULL,但如果这是适合您的方向,则有一种解决方法。 GROUP BY也确定排序,因此更改group by中的列顺序。

如果我的答案仍然缺失,那么一些样本数据/测试用例会很有用。

你的例子中令人困惑的部分就是当你说

  

每列对应data.value   对于选定的acq.id_cu - data.id_meas    - data.id_elab组合。

但是当您在示例查询中选择数据时,您可以直接将它们选择为仅具有日期时间分组的列,因此如果它实际上是值的组合,则无法识别哪个行对应于哪个组合(可能有多个行)某某日期)。如果它不是您要过滤/分组的所有值的组合,但确定max值的分组条件直接取决于datetime。