如何计算bigquery数组字段中元素的频率

时间:2018-01-23 21:44:39

标签: google-bigquery standard-sql

我有一个看起来像这样的表:

enter image description here

我正在寻找一个表格,其中列出了l_0, l_1, l_2, l_3字段中元素的频率计数。

例如,输出应如下所示:

| author_id  | year | l_o.name         | l_0.count| l1.name    | l1.count | l2.name             | l2.count| l3.name            | l3.count|
| 2164089123 | 1987 | biology          | 3        | botany     | 3        |                     |         |                    |         |
| 2595831531 | 1987 | computer science | 2        | simulation | 2        | computer simulation | 2       | mathematical model | 2       |

修改

在某些情况下,数组字段可能包含多种类型的元素。例如,l_0可以是['biology', 'biology', 'geometry', 'geometry']。在这种情况下,字段l_0, l_1, l_2, l_3的输出将是嵌套的重复字段,其中包含l_0.name中的所有元素以及l_0.count中的所有相应计数。

1 个答案:

答案 0 :(得分:2)

这应该有效,假设你想依靠每个数组:

SELECT
  author_id,
  year,
  (SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
   FROM UNNEST(l_0) AS l_0) AS l_0,
  (SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
   FROM UNNEST(l_1) AS l_1) AS l_1,
  (SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
   FROM UNNEST(l_2) AS l_2) AS l_2,
  (SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
   FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;

为避免这么多重复,您可以使用SQL UDF:

CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
  (SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
   FROM UNNEST(elements) AS elem)
);

SELECT
  author_id,
  year,
  GetNameAndCount(l_0) AS l_0,
  GetNameAndCount(l_1) AS l_1,
  GetNameAndCount(l_2) AS l_2,
  GetNameAndCount(l_3) AS l_3
FROM YourTable;

如果您可能需要在数组中考虑多个不同的名称,则可以让UDF返回其中包含关联计数的数组:

CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
  ARRAY(
    SELECT AS STRUCT elem AS name, COUNT(*) AS count
    FROM UNNEST(elements) AS elem
    GROUP BY elem
    ORDER BY count
  )
);

SELECT
  author_id,
  year,
  GetNamesAndCounts(l_0) AS l_0,
  GetNamesAndCounts(l_1) AS l_1,
  GetNamesAndCounts(l_2) AS l_2,
  GetNamesAndCounts(l_3) AS l_3
FROM YourTable;

请注意,如果您想跨行执行计数,则需要取消数组并在外层执行GROUP BY,但它看起来不像是你的意图是基于这个问题。