Question

我有一个包含超过2亿个文档的单个集合，其中包含维度（我想要过滤或分组的内容）和指标（我希望总结或得到平均值的内容）。我目前正在解决一些性能问题，我希望就如何优化/扩展MongoDB或其他解决方案的建议获得一些建议。我使用WiredTiger运行最新的稳定MongoDB版本。这些文件基本上如下所示：

{
  "dimensions": {
    "account_id": ObjectId("590889944befcf34204dbef2"),
    "url": "https://test.com",
    "date": ISODate("2018-03-04T23:00:00.000+0000")
  },
  "metrics": {
    "cost": 155,
    "likes": 200
  }
}

我在这个集合上有三个索引，因为在这个集合上运行了各种聚合：

ACCOUNT_ID
日期
account_id和日期

以下聚合查询提取3个月的数据，总结成本和喜欢以及按周/年分组：

db.large_collection.aggregate(

    [
        {
            $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
        },

        {
            $match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
        },

        {
            $group: { 
              cost: { $sum: "$metrics.cost" }, 
              likes: { $sum: "$metrics.likes" }, 
              _id: { 
                year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }, 
                week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } } 
              } 
            }
        },

        { 
            $project: {
                cost: 1, 
                likes: 1 
            }
        }
    ],

    {
        cursor: {
            batchSize: 50
        },
        allowDiskUse: true
    }

);

此查询大约需要25-30秒才能完成，我希望将此减少至少5-10秒。它目前是一个MongoDB节点，没有分片或任何东西。解释查询可以在这里找到：https://pastebin.com/raw/fNnPrZh0和executionStats：https://pastebin.com/raw/WA7BNpgA正如您所看到的，MongoDB正在使用索引，但仍有130万个文档需要读取。我目前怀疑我面临一些I / O瓶颈。

有没有人知道如何改进这个聚合管道？分片会有帮助吗？ MonogDB是否是正确的工具？

Answer 1

如果预先计算维度 >每个记录中的是一个选项，以下内容可以提高的优势。

如果此类查询代表此集合上查询的重要部分，那么包含其他字段以使这些查询更快可能是一种可行的替代方案。

这还没有进行基准测试。

此查询的一个代价高昂的部分可能来自使用日期。

首先在$group阶段计算每个匹配记录的年份和与特定时区相关的iso周。

然后，在较小程度上，在初始过滤期间，保留最近3个月的日期。

这个想法是在每个记录中存储年份和isoweek，对于给定的示例，这将是{ "year" : 2018, "week" : 10 }。这样_id阶段中的$group密钥就不需要任何计算（否则将代表1M3复杂的日期操作）。

以类似的方式，我们还可以在每个记录中存储关联的月份，对于给定的示例，该月份为{ "month" : "201803" }。这样，第一次匹配可以在几个月[2, 3, 4, 5]上，然后对确切的时间戳应用更精确和更昂贵的过滤。这样可以节省200M记录的初始成本Date过滤到简单的Int过滤。

让我们使用这些新的预先计算的字段创建一个新的集合（在实际场景中，这些字段将包含在记录的初始insert期间）：

db.large_collection.aggregate([ { $addFields: { "prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }, "prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }, "prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } } }}, { "$out": "large_collection_precomputed" } ])

将存储这些文件：

{ "dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") }, "metrics" : { "cost" : 155, "likes" : 200 }, "prec" : { "year" : 2018, "week" : 10, "month" : "201803" } }

让我们查询：

db.large_collection_precomputed.aggregate([ // Initial gross filtering of dates (months) (on 200M documents): { $match: { "prec.month": { $gte: "201802", $lte: "201805" } } }, { $match: { "dimensions.account_id": { $in: [ ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2") ]} }}, // Exact filtering of dates (costlier, but only on ~1M5 documents). { $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } }, { $group: { // The _id is now extremly fast to retrieve: _id: { year: "$prec.year", "week": "$prec.week" }, cost: { $sum: "$metrics.cost" }, likes: { $sum: "$metrics.likes" } }}, ... ])

在这种情况下，我们会使用account_id和month上的索引。

注意：此处，月份存储为String（"201803"），因为我不确定如何在聚合查询中将它们转换为Int。但最好是在插入记录时将它们存储为Int

作为副作用，这显然会使集合的存储磁盘/ ram更重。

聚合管道缓慢，收集量大

1 个答案: