SQL-jaccard相似性

时间:2016-04-18 22:09:33

标签: sql google-bigquery

我的表格如下:

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...

我希望我的输出看起来像:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10

谢谢你

3 个答案:

答案 0 :(得分:1)

尝试以下(适用于BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2
  

为BigQuery Standard SQL添加了解决方案

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

我喜欢这个:

  • 更简单/更友好的代码
  • 没有CROSS JOIN和额外的GROUP BY需要

当/如果尝试 - 请务必取消选中显示选项

下的Use Legacy SQL复选框

答案 1 :(得分:0)

受米哈伊尔·伯利安(Mikhail Berlyant)的第二个答案的启发,这里基本上是为Presto重新格式化的相同方法(作为另一种SQL风格的示例)。同样,所有这些都归功于Mikhail。

python3 -m grpc_tools.protoc --proto_path=api
                             --proto_path=/Users/Jack/api-common-protos/google
                             api/v1/foo.proto

请注意,gcloud endpoints services deploy api_descriptor.pb api-config.yaml We encountered the following errors while processing this API specification: API parse error: Error: ENOENT: no such file or directory, open '/tmp/google/api/client.proto' Please correct these errors and try again. 的计数会略有不同,因为它仅统计唯一的条目,例如WITH YourTable AS ( SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp ), tempTable AS ( SELECT author, SPLIT(grp, ',') AS grp FROM YourTable ) SELECT a.author AS author1, b.author AS author2, CARDINALITY(a.grp) AS count1, CARDINALITY(b.grp) AS count2, CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count, CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count FROM tempTable a JOIN tempTable b ON a.author < b.author ; 有两个harry值,但只会计算一个:

union_count

答案 2 :(得分:0)

我建议此选项可更好地扩展:

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),

tempTable AS (
  SELECT author, grp
  FROM YourTable, UNNEST(SPLIT(grp)) as grp
),

intersection AS (
  SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
  FROM tempTable a 
  JOIN tempTable b
  USING (grp)
  WHERE a.author > b.author
  GROUP BY a.author, b.author
),

count_distinct_groups AS (
  SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
  FROM tempTable
  GROUP BY author
),

join_it AS (
  SELECT
    intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
  FROM
    intersection
  JOIN
    count_distinct_groups cg1
  ON
    intersection.author1 = cg1.author
  JOIN
    count_distinct_groups cg2
  ON
    intersection.author2 = cg2.author
)

SELECT
  *,
  count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
  intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
  join_it

对大数据(数万x百万)的完全交叉联接因过多的改组而失败,而第二个建议需要花费数小时才能执行。那需要几分钟。

这种方法的结果是不会出现没有交集的对,因此使用它来处理IFNULL的过程将由该进程负责。

最后一个细节:丹尼尔和哈里的并集是10,而不是11,因为在最初的示例中重复了第15组。