Question

我有一个如下所示的数据集：

[name.split()[0] + " " + name.split()[2] for name in names]

是否可以通过执行一个mongo查询获得所需的输出，或者我是否必须运行许多单独的查询？小时是文档的一天中的小时。

我还应该注意到该系列每天大约有大约100万条目。有大约400种不同的uid。

{ uid: 1000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "bbb", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "bbb", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "aaa", to: "aaa": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "bbb", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "ccc", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "ddd", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "eee", to: "eee": timestamp: ISODate("2016-02-02T18:42:06.336Z") }

Answer 1

虽然在你的问题中应该更清楚，但是你的输出样本表明你正在寻找：

每个＆＃34; uid＆＃34;
＆＃34;到＆＃34;
来自＆＃34;
每小时计数摘要＆＃34;对于每个＆＃34; uid＆＃34;

这在单个聚合语句中是可能的，它只需要仔细管理不同的列表，然后进行一些操作以在24小时内映射每小时的结果。

这里的最佳方法是由MongoDB 3.2中引入的运算符辅助的：

db.collection.aggregate([
    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" }
     }},

     // Map out for each hour and count size of distinct lists
     { "$project": {
        "count": "$total",
        "from_count": { "$size": "$from" },
        "to_count": { "$size": "$to" },
        "hours": {
            "$map": {
                "input": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
                 ],
                 "as": "el",
                 "in": {
                      "$ifNull": [
                          { "$arrayElemAt": [
                              { "$map": {
                                  "input": { "$filter": {
                                     "input": "$temp_hours",
                                     "as": "tmp",
                                     "cond": {
                                         "$eq": [ "$$el", "$$tmp.index" ]
                                     }
                                  }},
                                 "as": "out",
                                 "in": "$$out.count"
                              }},
                              0
                          ]},
                          0
                      ]
                 }
            }
        }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
 ])

在MongoDB 3.2之前，你需要更多地参与映射当天所有时间的数组内容：

db.collection.aggregate([

    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct, also adding the indexes array
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" },
        "indexes": { "$first": { "$literal": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
        ] } }
     }},

     // Denormalize both arrays
     { "$unwind": "$temp_hours" },
     { "$unwind": "$indexes" },

     // Marry up the index entries and keep either the value or 0
     // Note you are normalizing the double unwind to distinct index
     { "$group": {
         "_id": {
             "_id": "$_id",
             "index": "$indexes"
         },
         "total": { "$first": "$total" }, 
         "from": { "$first": "$from" },
         "to": { "$first": "$to" },
         "count": {
             "$max": {
                 "$cond": [
                     { "$eq": [ "$indexes", "$temp_hours.index" ] },
                     "$temp_hours.count",
                     0
                 ]
             }
         }
     }},

     // Sort to keep index order - !!Important!!         
     { "$sort": { "_id": 1 } },

     // Put the hours into the array and get sizes for other results
     { "$group": {
         "_id": "$_id._id",
         "count": { "$first": "$total" },
         "from_count": { "$first": { "$size": "$from" } },
         "to_count": { "$first": { "$size": "$to" } },
         "hours": { "$push": "$count" }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
])

为了打破这种局面，这两种方法都遵循相同的基本步骤，并且在＆＃34;小时＆＃34;的映射中发生唯一真正的差异。为期24小时。

在第一个聚合$group阶段，目标是在数据和每个＆＃34; uid＆＃34;中获得每小时的结果。值。 $hour的简单日期聚合运算符有助于将此值作为分组键的一部分来获取。

$addToSet操作是一种＆＃34;迷你组＆＃34;在他们自己，这是允许保持＆＃34;不同的集合＆＃34;对于每个＆＃34;到＆＃34; ＆＃34;来自＆＃34;价值虽然基本上仍然是每小时分组。

下一个$group更多＆＃34;组织＆＃34;，正如记录的＆＃34;计数＆＃34;每个小时都保存在一个数组中，同时将所有数据汇总到每个＆＃34; uid＆＃34;。这基本上为您提供了所有的数据＆＃34;你真的需要结果，但当然这里的$addToSet操作只是在数组中添加＆＃34;数组＆＃34;每小时确定的不同集合。

为了将这些值作为真正不同的列表，每个＆＃34; uid＆＃34;并且只有，有必要使用$unwind解构每个数组，然后最后分组回来作为不同的＆＃34;设置＆＃34;。相同的$addToSet压缩了这一点，而$first操作只是采用了＃34;第一个＆＃34;其他字段的值，对于目标而言已经完全相同＆＃34; per uid＆＃34;数据。我们对这些感到满意，所以请保持原样。

这里的最后阶段基本上是＆＃34;化妆品＆＃34;在本质上，同样可以在客户端代码中实现。由于每小时间隔不存在数据，因此需要将其映射到表示每小时的值数组。这两种方法在不同版本之间可用运算符的能力上有所不同。

在MongoDB 3.2版本中，有$filter和$arrayElemAt个运算符可以有效地创建逻辑以转换＆＃34;所有可能索引位置（24小时）的输入源，这些值已经确定了可用数据中小时数的计数值。这是基本的直接查找＆＃34;已记录的每个可用小时的值，以查看它是否存在，计数在何处转换为完整数组。如果不存在，则使用默认值0。

没有这些操作员，这样做＆＃34;匹配＆＃34;实质上意味着对两个阵列（记录数据和完整的24个位置）进行去标准化以进行比较和转置。这是第二种方法中发生的事情，简单地比较了＆＃34;指数＆＃34;值以查看该小时是否有结果。这里的$max运算符主要用于两个$unwind语句，其中源数据的每个记录值将针对每个可能的索引位置进行再现。这个＆＃34;压缩＆＃34;直到每个＆＃34;索引小时＆＃34;。

所需的值

在后一种方法中，$sort对分组_id值变得很重要。这是因为它包含＆＃34;索引＆＃34;位置，将此内容移回到您希望订购的数组中时需要这样做。这当然是最后$group阶段，其中有序位置被放入$push的数组中。

回到＆＃34;不同列表＆＃34;，$size运算符用于所有情况以确定＆＃34;长度＆＃34;因此＆＃34;计数＆＃34; ＆＃34;到＆＃34;列表中的不同值＆＃34;来自＆＃34;。这至少是对MongoDB 2.6的唯一真正约束，但是否则可以简单地替换为＆＃34;展开＆＃34;每个数组单独，然后分组回已经存在的_id，以计算每个集合中的数组条目。这是一个基本过程，但正如您应该看到的$size运算符是整体性能的更好选择。

作为最后一点，您的结论数据有点偏差，因为可能会输入＆＃34; ddd＆＃34;来自＆＃34;来自＆＃34;在＆＃34;到＆＃34;中的目的也是相同的，而是记录为＆＃34; bbb＆＃34;。这改变了第三个＆＃34; uid＆＃34;分组＆＃34;到＆＃34;一个条目。但当然，源数据的逻辑结果是合理的：

{ "_id" : 1000000, "count" : 3, "from_count" : 2, "to_count" : 2, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 2000000, "count" : 2, "from_count" : 1, "to_count" : 1, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 3000000, "count" : 5, "from_count" : 5, "to_count" : 4, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0 ] }

N.B源也有一个拼写错误，在所有行的时间戳之后，分隔符插入:而不是逗号。

Mongodb聚合查询，还是太复杂？

1 个答案: