选择X个最大值的列名

时间:2018-09-27 12:44:40

标签: sql google-bigquery

我已经创建了一个用户和与产品类别的交互的矩阵,我的数据如下所示,其中每一行是一个用户,每一列是一个类别,数字表示他们与该类别进行了多少互动:< / p>

[CustomAction]
public static ActionResult StartService(Session session) {

     string installDir = session.Property("INSTALLDIR"); //<--this works on install even when using a custom path
     string workingDir = Path.Combine(installDir, @"\SomePathToTheBatchFile");
     RunCmdMethode(workingDir, "something.bat -some arguments");

     return ActionResult.Success;
}
[CustomAction]
public static ActionResult UninstallService(Session session) {

     string installDir = session.Property("INSTALLDIR"); //<--this does not give back the right path on uninstall in case the default path was changed during installation
     string workingDir = Path.Combine(installDir, @"\SomePathToTheBatchFile");
     RunCmdMethode(workingDir, "something.bat -some arguments");

     return ActionResult.Success;
}

我想添加一列(在此查询中或在此表的新查询中),该列将为每个用户返回包含最高值的3个列名称。

我的完整数据有200列以上。

关于如何在StandardSQL中实现此目标的任何建议?

这是我用来构建网格的代码:

    User     Cat1     Cat2     Cat3     Cat4     Cat5     ...
    1        0        1        0        2        30
    2        0        0        10       5        0
    3        0        5        0        0        0
    4        2        0        20       2        0
    5        0        40       0        0        0
    ...

3 个答案:

答案 0 :(得分:1)

以下内容适用于BigQuery Standard SQL(即使示例只有5个,也不依赖于类别列的数量)

const ImageDiv = styled.div`
  height: 100px;
  width: 100px;
  margin-top: 15px;
  background: url(${props => props.src});
`;

您可以使用问题中的虚拟数据进行上述测试和操作:

#standardSQL
SELECT *, 
  ARRAY_TO_STRING(ARRAY(
    SELECT SPLIT(kv, ':')[OFFSET(0)]
    FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
    WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
    ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
    LIMIT 3
  ), ',') top3_cat
FROM `yourproject.yourdataset.yourtable` t

有结果

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 user, 0 cat1, 1 cat2, 0 cat3, 2 cat4, 30 cat5 UNION ALL
  SELECT 2, 0, 0, 10, 5, 0 UNION ALL
  SELECT 3, 0, 5, 0, 0, 0 UNION ALL
  SELECT 4, 2, 0, 20, 2, 0 UNION ALL
  SELECT 5, 0, 40, 0, 0, 0 
)
SELECT *, 
  ARRAY_TO_STRING(ARRAY(
    SELECT SPLIT(kv, ':')[OFFSET(0)]
    FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
    WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
    ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
    LIMIT 3
  ), ',') top3_cat
FROM `project.dataset.table` t
  

我已经用构建矩阵的代码更新了问题,您介意展示如何集成您的解决方案吗?

Row user    cat1    cat2    cat3    cat4    cat5    top3_cat     
1   1       0       1       0       2       30      cat5,cat4,cat2   
2   2       0       0       10      5       0       cat3,cat4,cat2   
3   3       0       5       0       0       0       cat2,cat3,cat1   
4   4       2       0       20      2       0       cat3,cat4,cat1   
5   5       0       40      0       0       0       cat2,cat3,cat1   

答案 1 :(得分:0)

扩展我的评论:如果您的数据采用更合理的格式,例如user | category | cat_count,则可以执行以下操作:

SELECT user, group_concat(category) as top_3_cat
FROM
    (
        SELECT user, category, rank() OVER (PARTITION BY user ORDER BY cat_count) as cat_rank
        FROM yourtable
    ) cat_ranking
WHERE cat_rank <= 3;

考虑到作为列的类别数,在当前模式下执行此操作几乎是不可能的。

我将首先着重于透视表,以便可以通过上述sql运行它。尽管我不确定取消透视列的限制是多少,但可以使用bigquery's unpivot transform来实现。

unpivot col:cat1, cat2, cat3, cat4, cat5, catN groupEvery:N

我不使用bigquery,所以不确定如何将其应用于您的数据集,但看起来很有希望。

另一个选择是在上面的sql中,UNION许多语句一起组成yourtable

SELECT user, 'cat1' as category, cat1 FROM yourtable
UNION ALL SELECT user, 'cat2', cat2 FROM yourtable
UNION ALL SELECT user, 'cat3', cat3 FROM yourtable
UNION ALL SELECT user, 'cat4', cat4 FROM yourtable
UNION ALL SELECT user, 'cat5', cat5 FROM yourtable
UNION ALL SELECT user, 'catN', catN FROM yourtable;

答案 2 :(得分:0)

您将在bigquery中使用数组:

select t.*,
       (select array_agg(s.colname order by s.val desc limit 3)
        from unnest(array[struct('col1' as colname), col1 as val),
                          struct('col2' as colname), col2 as val),
                          . . .
                         ]
                   ) s
       ) as top3
from t