所以我想为一个项目做以下事情。
我有3张桌子。前两个关注我们(第三个是为了更好地理解):
author {id, name}
authorship {id, id1, id2}
paper {id, title}
作者身份将作者与纸张和作者联系起来.id1指的是author.id,authorship.id2指的是paper.id。
我想要做的是为每个作者创建一个带有节点的图形,边缘由两位作者之间的普通论文数量决定。
w=1 - union_of_common_papers/intersection_of_common_papers
所以我已经构建了(在stackoverflow的帮助下)一个sql脚本,它返回所有共同作者夫妇加上联合和常见论文交集的数量。之后我将使用java的数据。它是以下内容:
SELECT DISTINCT a1.name, a2.name, (
SELECT concat(count(a.id2), ',', count(DISTINCT a.id2))
FROM authorship a
WHERE a.id1=a1.id or a.id1=a2.id) as weight
FROM authorship au1
INNER JOIN authorship au2 ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1
INNER JOIN author a1 ON au1.id1 = a1.id
INNER JOIN author a2 ON au2.id1 = a2.id;
这完成了我的工作,并返回如下列表:
+-----------------+---------------------+---------+
| name | name | weight |
+-----------------+---------------------+---------+
| Kurt | Michael | 161,157 |
| Kurt | Miron | 138,134 |
| Kurt | Manish | 19,18 |
| Roy | Gregory | 21,20 |
| Roy | Richard | 74,71 |
....
在重量中,我可以看到2个数字a,b,其中b是交叉点,b-a是普通纸张的并集。
但这需要很多时间。 所有开销都来自这个额外的子选择
(SELECT concat(count(a.id2), ',', count(DISTINCT a.id2))
FROM authorship a
WHERE a.id1=a1.id or a.id1=a2.id) as weight
没有这一行,所有记录(1M +)都在不到2分钟内返回。 用这条线50条记录需要超过1.5分钟
我通过命令行在linux上使用mysql
我可以如何优化它?
这是解释此查询返回的内容。不知道如何使用它。
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
| 1 | PRIMARY | a1 | ALL | PRIMARY | NULL | NULL | NULL | 124768 | Using temporary |
| 1 | PRIMARY | au1 | ref | NewIndex1,NewIndex2 | NewIndex1 | 5 | dblp.a1.ID | 4 | Using where |
| 1 | PRIMARY | au2 | ref | NewIndex1,NewIndex2 | NewIndex2 | 5 | dblp.au1.id2 | 1 | Using where |
| 1 | PRIMARY | a2 | eq_ref | PRIMARY | PRIMARY | 4 | dblp.au2.id1 | 1 | |
| 2 | DEPENDENT SUBQUERY | a | ALL | NewIndex1 | NULL | NULL | NULL | 1268557 | Using where |
+----+--------------------+-------+--------+---------------------+-----------+---------+--------------+---------+-----------------+
答案 0 :(得分:0)
您应该能够直接从外部查询中的联接中获取数据。
您可以通过统计两个作者的相同的id2
来计算共同论文的数量。
您可以将论文总数计算为每位作者的不同论文数量减去共同的数量(因为否则,这些将被计算两次):
SELECT a1.name, a2.name,
COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
COUNT(distinct au1.id2) + COUNT(distinct au2.id2) - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as TotalPapers
FROM authorship au1 INNER JOIN
authorship au2
ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
author a1
ON au1.id1 = a1.id INNER JOIN
author a2
ON au2.id1 = a2.id
group by a1.name, a2.name;
在您的数据结构中,id1
和id2
是糟糕的名称。您是否考虑过类似idauthor
和idpaper
之类的东西?
由于初始内连接,上面的查询正确计算了交集,但不计算总数。一种解决方法是full outer join
,但MySQL中不允许这样做。我们可以使用其他子查询来执行此操作:
SELECT a1.name, a2.name,
COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
(ap1.NumPapers + ap2.NumPapers - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end)
) as TotalPapers
FROM authorship au1 INNER JOIN
authorship au2
ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
author a1
ON au1.id1 = a1.id INNER JOIN
author a2
ON au2.id1 = a2.id inner join
(select au.id1, count(*) as numpapers
from authorship au
) ap1
on ap1.id1 = au1.id1 inner join
(select au.id1, count(*) as numpapers
from authorship au
) ap2
on ap2.id1 = au2.id1 inner join
group by a1.name, a2.name;