Bigquery - 使用SemiJoins过滤重复字段

时间:2016-01-14 17:50:32

标签: google-bigquery

我试图根据重复字段中的项是否位于另一个表的列中来从一个表中选择记录。我在代码中明确列出了我正在测试的项目,但是从另一个表中选择时却没有这样做。让我演示使用trigrams数据集:

让我们说我想选择在某些年份出现的所有记录。但我并不只是想要那些年份的数据 - 我希望所有与这些记录相关的数据。如果我只想要几年的数据,我可以做这样的事情(这很有效):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

然而,不是编码' 1800'和' 1801'在我的查询中,我有一个表years,其中包含我感兴趣的一组年份。我希望这可以工作:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

这不起作用,因为bigquery要求半连接成为WHEREHAVING子句的一部分。

所以我尝试重新排列(回到第一个查询):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD

这会导致错误Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>

现在没有WITHIN RECORD

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))

这会导致错误SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause

但我没有聚合!所以现在我将过滤器移到WHERE

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))

这告诉我Invalid function name: SOME。什么?!

有没有办法通过BigQuery获取我正在寻找的行为?

2 个答案:

答案 0 :(得分:1)

下面解决了您的示例,我希望您能够将其扩展到您的实际用例(如果您想要解决方案)

SELECT 
    ngram, cell.value, cell.volume_count, 
    cell.volume_fraction, cell.page_count, cell.match_count
FROM [publicdata:samples.trigrams] AS trigrams
JOIN (
  SELECT ngram AS qualified
  FROM (
    FLATTEN((SELECT ngram, cell.value AS value
      FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
  ) AS t
  JOIN [mydataset.years] AS y
  ON y.year_as_str = t.value
  GROUP BY 1
) AS valid
ON valid.qualified = trigrams.ngram

请注意[publicdata:samples.trigrams]字段中cell.valueREPEATED STRING的事实 - 这就是为什么你会看到“额外”FLATTEN的事情

答案 1 :(得分:0)

您可以使用http://kvalixhu.digitalthinkersni.co.uk/termekek/plc-hmi/unitronics/子句。这可能需要双重否定,因为您需要省略满足某些条件的记录。以下查询应该有效:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))