Question

所以我想为一个项目做以下事情。

我有3张桌子。前两个关注我们（第三个是为了更好地理解）：

author {id, name}
authorship {id, id1, id2}
paper {id, title}

作者身份将作者与纸张和作者联系起来.id1指的是author.id，authorship.id2指的是paper.id。

我想要做的是为每个作者创建一个带有节点的图形，边缘由两位作者之间的普通论文数量决定。

w=1 - union_of_common_papers/intersection_of_common_papers

所以我已经构建了（在stackoverflow的帮助下）一个sql脚本，它返回所有共同作者夫妇加上联合和常见论文交集的数量。之后我将使用java的数据。它是以下内容：

SELECT DISTINCT a1.name, a2.name, (
  SELECT  concat(count(a.id2), ',', count(DISTINCT a.id2)) 
  FROM authorship a 
  WHERE a.id1=a1.id or a.id1=a2.id) as weight
FROM authorship au1 
INNER JOIN authorship au2 ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 
INNER JOIN author a1 ON au1.id1 = a1.id 
INNER JOIN author a2 ON au2.id1 = a2.id;

这完成了我的工作，并返回如下列表：

+-----------------+---------------------+---------+
| name            | name                | weight  |
+-----------------+---------------------+---------+
| Kurt            | Michael             | 161,157 |
| Kurt            | Miron               | 138,134 |
| Kurt            | Manish              | 19,18   |
| Roy             | Gregory             | 21,20   |
| Roy             | Richard             | 74,71   |
....

在重量中，我可以看到2个数字a，b，其中b是交叉点，b-a是普通纸张的并集。

但这需要很多时间。所有开销都来自这个额外的子选择

  (SELECT  concat(count(a.id2), ',', count(DISTINCT a.id2)) 
  FROM authorship a 
  WHERE a.id1=a1.id or a.id1=a2.id) as weight

没有这一行，所有记录（1M +）都在不到2分钟内返回。用这条线50条记录需要超过1.5分钟

我通过命令行在linux上使用mysql

我可以如何优化它？

作者有~130,000条记录
作者身份~1,300,000条记录
查询应返回~1,200,000条记录

这是解释此查询返回的内容。不知道如何使用它。

+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
| id | select_type        | table | type   | possible_keys       | key       | key_len | ref          | rows    | Extra           |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
|  1 | PRIMARY            | a1    | ALL    | PRIMARY             | NULL      | NULL    | NULL         |  124768 | Using temporary |
|  1 | PRIMARY            | au1   | ref    | NewIndex1,NewIndex2 | NewIndex1 | 5       | dblp.a1.ID   |       4 | Using where     |
|  1 | PRIMARY            | au2   | ref    | NewIndex1,NewIndex2 | NewIndex2 | 5       | dblp.au1.id2 |       1 | Using where     |
|  1 | PRIMARY            | a2    | eq_ref | PRIMARY             | PRIMARY   | 4       | dblp.au2.id1 |       1 |                 |
|  2 | DEPENDENT SUBQUERY | a     | ALL    | NewIndex1           | NULL      | NULL    | NULL         | 1268557 | Using where     |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+

Answer 1

您应该能够直接从外部查询中的联接中获取数据。

您可以通过统计两个作者的相同的id2来计算共同论文的数量。

您可以将论文总数计算为每位作者的不同论文数量减去共同的数量（因为否则，这些将被计算两次）：

SELECT a1.name, a2.name,
       COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
       COUNT(distinct au1.id2) + COUNT(distinct au2.id2) - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as TotalPapers
FROM authorship au1 INNER JOIN
     authorship au2
     ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
     author a1
     ON au1.id1 = a1.id INNER JOIN
     author a2
     ON au2.id1 = a2.id
group by a1.name, a2.name;

在您的数据结构中，id1和id2是糟糕的名称。您是否考虑过类似idauthor和idpaper之类的东西？

由于初始内连接，上面的查询正确计算了交集，但不计算总数。一种解决方法是full outer join，但MySQL中不允许这样做。我们可以使用其他子查询来执行此操作：

SELECT a1.name, a2.name,
       COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
       (ap1.NumPapers + ap2.NumPapers - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end)
       ) as TotalPapers
FROM authorship au1 INNER JOIN
     authorship au2
     ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
     author a1
     ON au1.id1 = a1.id INNER JOIN
     author a2
     ON au2.id1 = a2.id inner join
     (select au.id1, count(*) as numpapers
      from authorship au
     ) ap1
     on ap1.id1 = au1.id1 inner join
     (select au.id1, count(*) as numpapers
      from authorship au
     ) ap2
     on ap2.id1 = au2.id1 inner join
group by a1.name, a2.name;

subselect使复杂查询变得非常慢

1 个答案: