子查询之间的自连接,用于重复检测

时间:2017-04-24 09:19:18

标签: mysql join mariadb

我遇到一个在子查询上执行自联接的查询时出现问题,它需要的时间比我想象的要多,而且我在理解原因时遇到了一些问题。

问题如下,所有者可以拥有物品,但某些物品可能会出现两次属于不同的所有者,每个所有者我们可能会获得有关物品的略有不同的信息,或者某些字段可能为空。

这是我的数据库的简单版本,它不包含FK,只有IdOwner,IdItem和IdCategry出现时才存在索引。

拥有者:

+----------------+---------------+------+-----+
| Field          | Type          | Null | Key | 
+----------------+---------------+------+-----+
| IdOwner        | bigint(20)    | NO   | PRI |
| IdPlace        | int(10)       | NO   |     |
| SomeDate       | datetime      | YES  |     |
+----------------+---------------+------+-----+

档案:

+----------------+---------------+------+-----+
| Field          | Type          | Null | Key | 
+----------------+---------------+------+-----+
| IdItem         | bigint(20)    | NO   | PRI |
| IdOwner        | bigint(20)    | NO   | MUL |
| IdCategory     | int(10)       | NO   |     |
| DupValue1      | varchar()     | YES  |     |
      .
      .
      .
| DupValueN      | varchar()     | YES  |     |
+----------------+---------------+------+-----+

国家:

+----------------+---------------+------+-----+
| Field          | Type          | Null | Key | 
+----------------+---------------+------+-----+
| IdOwner        | bigint(20)    | NO   | PRI |
| Country        | Varchar()     | NO   | PRI |
+----------------+---------------+------+-----+

当项目重复时,我发现的DupValues 1到N是最有可能相同的列。

这是我目前正在使用的查询的简化版本:

SELECT subquery1.IdItem, subquery2.IdItem FROM 
(SELECT i1.IdCategory, i1.IdOwner, i1.IdItem, i1.DupValue1, o1.IdSite, o1.SomeDate, COUNTRY.country
FROM ITEMS i1 
LEFT JOIN OWNER o1 ON o1.IdOwner=i1.IdOwner 
LEFT JOIN COUNTRY ON i1.IdOwner=COUNTRY.IdOwner
WHERE i1.IdOwner>9000000) 
as subquery1
INNER JOIN 
(SELECT i2.IdCategory, i2.IdOwner, i2.IdItem, i2.DupValue1, o2.IdSite, o2.SomeDate, COUNTRY.country
FROM ITEMS i2 
LEFT JOIN COUNTRY COUNTRY ON i2.IdOwner=COUNTRY.IdOwner
LEFT JOIN OWNER o2 ON o2.IdOwner=i2.IdOwner 
WHERE i2.IdOwner>9000000) 
as subquery2
ON subquery1.IdItem<subquery2.IdItem 
AND subquery1.IdCategory=subquery2.IdCategory 
AND subquery1.IdSite!=subquery2.IdSite AND subquery1.country=subquery2.country 
AND DATE(subquery1.SomeDate)=DATE(subquery2.SomeDate) 
AND (subquery1.DupValue1=subquery2.DupValue1 OR subquery1.DupValue1 IS NULL OR subquery2.DupValue1 IS NULL) 

还有一些SupValue具有相同的格式。

WHERE子句是为了限制所有者的数量,因为我仍在测试查询,当WHERE子句就位时,它将所有者限制为~770k行,并且使用该行数时,wuery需要大约30分钟过程

当我在查询中使用说明时,我得到了这个:

+------+-------------+---------+--------+----------------------------------------+-------------+---------+------------------------+-------+------------------------------------+
| id   | select_type | table   | type   | possible_keys                          | key         | key_len | ref                    | rows  | Extra                              |
+------+-------------+---------+--------+----------------------------------------+-------------+---------+------------------------+-------+------------------------------------+
|    1 | SIMPLE      | i1      | range  | PRIMARY,UnivocID,dg_owner,dg_category  | UnivocID    | 8       | NULL                   | 19056 | Using index condition              |
|    1 | SIMPLE      | o1      | eq_ref | PRIMARY                                | PRIMARY     | 8       | i1.IdTender            |     1 |                                    |
|    1 | SIMPLE      | country | ref    | PRIMARY                                | PRIMARY     | 8       | i1.IdTender            |     1 | Using index                        |
|    1 | SIMPLE      | i2      | ref    | PRIMARY,UnivocID,dg_owner,dg_category  | dg_category | 4       | i1.IdMolecule          |   657 | Using index condition; Using where |
|    1 | SIMPLE      | o2      | eq_ref | PRIMARY                                | PRIMARY     | 8       | i2.IdTender            |     1 | Using where                        |
|    1 | SIMPLE      | country | ref    | PRIMARY                                | PRIMARY     | 8       | i2.IdTender            |     1 | Using index                        |
+------+-------------+---------+--------+----------------------------------------+-------------+---------+------------------------+-------+------------------------------------+

MariaDB版本:10.1

我的2个问题:

¿subquery2的每一行都执行subquery1,这是导致执行时间长的原因,还是ON条款的性质有错吗?

¿查询可以改进,也许可以放弃JOIN或其他运营商吗?

2 个答案:

答案 0 :(得分:0)

我无法将其测试为没有表格布局或测试数据(您还可以在join子句中引用名为SubmissionDate的列,但该字段不会从子查询中返回),但以下内容应避免使用子查询。希望能够更好地使用索引: -

SELECT subquery1.IdItem, subquery2.IdItem 
FROM ITEMS i1 
INNER JOIN ITEMS i2 
ON i1.IdItem < i1.IdItem AND i1.IdCategory = i1.IdCategory AND (i1.DupValue1 = i2.DupValue1 OR i1.DupValue1 IS NULL OR i2.DupValue1 IS NULL)
INNER JOIN OWNER o1 ON o1.IdOwner = i1.IdOwner 
INNER JOIN COUNTRY c1 ON i1.IdOwner = c1.IdOwner
INNER JOIN OWNER o2 ON o2.IdOwner = i2.IdOwner AND o1.IdSite != o2.IdSite
INNER JOIN COUNTRY c2 ON i2.IdOwner = c2.IdOwner AND c1.country = c2.country 
WHERE i1.IdOwner > 9000000
AND i2.IdOwner > 9000000

答案 1 :(得分:0)

调查或处理重复的另一种方法是将每个表中的行一起收集到一个表中,然后在该表上进行GROUP BY

CREATE TEMPORARY TABLE t
    ( SELECT stuff from one table or set of tables )
    UNION ALL
    ( SELECT stuff from the other table or tables )
;
SELECT * FROM t
    GROUP BY IdOwner, IdSite, country
;

如果需要,可以在&#34; stuff&#34;中添加一个额外的列。区分来源:

SELECT 1 AS source, ...

表现不佳的原因:

FROM ( subquery1 )
JOIN ( subquery2 ) ON ...

没有要执行ON的索引(直到至少5.6)。因此子查询结果被完全扫描。即使使用5.6,索引的创建也有一些开销。

另一个提示,重新:AND DATE(subquery1.SomeDate)=DATE(subquery2.SomeDate):在构建子查询时计算DATE(SomeDate) - 这使得它成为一次性过程,而不是重复过程,因为正在执行子表扫描。