BigQuery GROUP由前n个类别和组休息在"其他"

时间:2015-09-01 05:22:43

标签: google-bigquery

我经常遇到同样的任务 - 通过分类变量中的前X值汇总数据,然后滚动其他所有内容"其他"。

到目前为止,我正在使用这个技巧:

SELECT
year,
if(tt.state is null, "other", t.state) as state_filtered,
count(1) as children
FROM [publicdata:samples.natality] as t
LEFT OUTER JOIN (
  SELECT state, count(1) as children FROM [publicdata:samples.natality]
  WHERE state is not null
  GROUP BY state
  ORDER BY children DESC
  LIMIT 5
) as tt ON tt.state=t.state
GROUP BY year, state_filtered
ORDER BY year, state_filtered

但它不是很干净,因为我两次查询同一个表,而在现实生活中,代码变得太复杂了。我一直在寻找使用ROLLUP或TOP的解决方案,但没有找到更好的解决方案。

有人知道更好的方法吗?

3 个答案:

答案 0 :(得分:3)

您可以在子查询中使用Row_Number。

SELECT
  IF (RNB<=5, state, "Other") AS state,
  SUM(children) AS Children
FROM (
  SELECT
    state,
    children,
    ROW_NUMBER() OVER (ORDER BY children DESC) AS RNB
  FROM (
    SELECT
      state,
      COUNT(1) AS children,
    FROM
      [publicdata:samples.natality]
    WHERE
      state IS NOT NULL
    GROUP BY
      state))
GROUP EACH BY
  state

答案 1 :(得分:3)

我认为只需一个子选择就足够了

SELECT 
  year,
  IF (pos <= 5, state, "other") AS state,
  SUM(children) AS children
FROM (
  SELECT
    year,
    state,
    ROW_NUMBER() OVER (PARTITION BY year ORDER BY children DESC) AS pos,
    COUNT(1) AS children,
  FROM
    [publicdata:samples.natality]
  WHERE
    state IS NOT NULL
  GROUP BY
    year, state
)
GROUP BY year, state
ORDER BY year, state

答案 2 :(得分:2)

我认为有一个捷径解决方案让你在全球拥有前5个州 没有连接 - 所以至少代码明智 - 它只进行一次扫描!与目前使用的原始代码相比,它快了两倍 不确定你是否愿意 - 取决于你的真实场景

SELECT
  year, 
  state, 
  SUM(children) as children
FROM (
  SELECT
    state,
    REGEXP_EXTRACT(year_info, r'^(\w+)') as year,
    INTEGER(REGEXP_EXTRACT(year_info, r'(\w+)$')) as children,
  FROM (
    SELECT
      CASE WHEN pos < 6 THEN state ELSE 'other' END state,
      SPLIT(years_list) as year_info
    FROM (
      SELECT 
        state,
        GROUP_CONCAT(STRING(year) + '|' + STRING(rows)) as years_list,
        ROW_NUMBER() OVER(ORDER BY children DESC) as pos,
        SUM(rows) as children
      FROM (
        SELECT year, state, COUNT(1) AS rows
        FROM [publicdata:samples.natality]
        WHERE state IS NOT NULL
        GROUP BY year, state
      )    
      GROUP BY state
    )
  )
)
GROUP BY year, state
ORDER BY year, state

我觉得有更好的方法来处理&#34; group_concat / split&#34;特技