Group by for each row in bigquery

时间:2018-12-03 12:53:55

标签: sql group-by google-bigquery

I have a table that stores user comments for each month. Comments are stored using UTC timestamps, I want to get the users that posts more than 20 comments per day. I am able to get the timestamp start and end for each day, but I can't group the comments table by number of comments. This is the script that I have for getting dates, timestamps and distinct users.

SELECT
DATE(TIMESTAMP_SECONDS(r.ts_start)) AS date,
r.ts_start AS timestamp_start,
r.ts_start+86400 AS timestamp_end,
COUNT(*) AS number_of_comments,
COUNT(DISTINCT s.author) AS dictinct_authors
FROM ((
  WITH
    shifts AS (
    SELECT
      [STRUCT(" 00:00:00 UTC" AS hrs,
        GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY) AS dt_range) ] AS full_timestamps )
  SELECT
    UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) AS ts_start,
    UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) + 86400 AS ts_end
  FROM
    shifts,
    shifts.full_timestamps
  LEFT JOIN
    full_timestamps.dt_range AS dt)) r
 INNER JOIN
`user_comments.2018_07` s
ON
(s.created_utc BETWEEN r.ts_start
  AND r.ts_end)
GROUP BY
r.ts_start
ORDER BY
number_of_comments DESC 

And this is the sample output 1: enter image description here

The user_comments.2018_07 table is as the following: enter image description here

More concretely I want the first output 1, has one more column showing the number of authors that have more than 20 comments for the date. How can I do that?

1 个答案:

答案 0 :(得分:1)

如果目标只是每天从表user_comments.2018_07中获取每天有20条以上注释的用户数量,并将其添加到到目前为止的输出中,则这应该简化您首次使用的查询。只要您不愿意每天保持最小/最大时间戳。

with nb_comms_per_day_per_user as (
SELECT
day,
author,
COUNT(*) as nb_comments
FROM
# unnest as we don't really want an array
unnest(GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY)) AS day
INNER JOIN `user_comments.2018_07` c
on
# directly convert timestamp to a date, without using min/max timestamp
date(timestamp_seconds(created_utc))
=
day
GROUP BY day, c.author
)

SELECT
day,
sum(nb_comments) as total_comments,
count(*) as distinct_authors, # we have already grouped by author
# sum + if enables to count "very active" users
sum(if(nb_comments > 20, 1, 0)) as very_active_users
FROM nb_comms_per_day_per_user
GROUP BY day
ORDER BY total_comments desc

我还假定未使用包含布尔值的列注释,因为您没有在初始查询中使用它?