BigQuery联接重复字段上的表

时间:2018-12-04 17:00:20

标签: google-bigquery

我有一个表,其中包含id作为单列和嵌套的多列。

1)

更好理解的示例架构:

id-字符串,

childrenNames-重复的字符串,

animalNames-重复的字符串,

另一个表仅包含单个列

2)

更好理解的示例架构:

childrenName-字符串,

animalName-字符串

我需要知道表2)中所有不在表1)中的记录 因此,childrenName和animalName都需要属于一个用户。

我可以补充一点,我试图为表2)中的每一列分别选择一个值,这些值是表1中的'IN'列表),但是如果它返回任何行,则也可能意味着这两个都属于到两个不同的ID(或更多)。

示例行表1)

id:1234,

childrenNames:['Ana','Frank'],

animalNames:['Rex','Max'],

示例行表2)

A)

childrenName:'Ana',

animalName:'Ozzy'

B)

childrenName:'Frank',

animalName:“雷克斯”

对于上述示例,我应该从表2)中获得A)行,因为“奥兹”不属于ID 1234(假设我们在表1中没有更多记录))

有人知道如何使用BigQuery SQL(标准或旧版)解决此类问题吗?

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName

您可以使用问题的数据示例进行测试,操作

#standardSQL
WITH `project.dataset.table1` AS (
  SELECT '1' id, ['Ana', 'Frank'] childrenNames,  ['Rex', 'Max'] animalNames 
), `project.dataset.table2` AS (
  SELECT 'Ana' childrenName, 'Ozzy' animalName UNION ALL
  SELECT 'Frank', 'Rex'
)
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE (SELECT COUNT(1) FROM UNNEST(childrenNames) cn WHERE cn = childrenName) > 0
AND (SELECT COUNT(1) FROM UNNEST(animalNames) an WHERE an = animalName) > 0
GROUP BY childrenName, animalName

有结果

Row childrenName    animalName  users    
1   Frank           Rex         1     

注意:输出中的字段users是重复的字符串/数组,由具有搜索对的用户列表组成

上面的不太详细的变化将是

#standardSQL
SELECT childrenName, animalName, ARRAY_AGG(DISTINCT id) users
FROM `project.dataset.table2`
CROSS JOIN `project.dataset.table1`
WHERE childrenName IN UNNEST(childrenNames)
AND animalName IN UNNEST(animalNames)
GROUP BY childrenName, animalName

结果完全相同

所以,显然-使用第二个:o)

  

...表1)有500万条记录,表2)200k-因此Query exceeded resource limits

尝试低于版本

#standardSQL
WITH flatten_table1 AS (
  SELECT id, childrenName, animalName
  FROM `project.dataset.table1`, 
  UNNEST(childrenNames) childrenName,
  UNNEST(animalNames) animalName
)
SELECT childrenName, animalName, id
FROM `project.dataset.table2`
JOIN flatten_table1
USING(childrenName, animalName)