Question

在db['TF']中，我有大约6000万条记录。

我需要获取记录的数量。

如果我运行db['TF'].count()，它将立即返回。

如果我运行db['TF'].count_documents({})，那要花很长时间才能得到结果。

但是，count方法将被弃用。

那么，在使用count_documents时如何快速获取数量？我错过了一些争论吗？

我已经阅读了文档和代码，但未找到任何内容。

非常感谢！

Answer 1

这与pymongo无关，而与mongo相关。

count是本机mongo函数。它不依赖于计算所有文档。

无论何时在mongo中插入或删除记录，他都会将所有记录缓存在排序规则中。那么当您对mongo进行计数时，它将返回该缓存的值

count_documents获取一个查询对象，这意味着它必须循环抛出所有重新编码才能计数。因为您一无所获，所以它必须遍历全部6000万条记录，这就是为什么它运行缓慢

基于@Stennie的评论

您可以在PyMongo 3.7+中使用Estimated_data_count（）返回基于集合元数据的快速计数。不建议使用原始count（），因为根据是否提供查询条件，行为有所不同（估计数与实际数）。较新的驱动程序API更加关注结果

Answer 2

正如已经提到的here，该行为并非特定于PyMongo。

原因是因为PyMongo中的count_documents方法执行聚合查询，并且不使用任何元数据。参见collection.py#L1670-L1688

UPDATE absensi2 
SET tgl_masuk = replace(tgl_masuk, '2019-01-02', '2019-01-03'), 
 state='Update' WHERE tgl_masuk='2019-01-02' && state='NULL'

此命令与behavior方法具有相同的collection.countDocuments。

话虽如此，如果您愿意为了提高性能而牺牲准确性，则可以使用estimated_document_count方法，该方法将count命令发送到同一{{3} }与behavior相同，请参见collection.estimatedDocumentCount

pipeline = [{'$match': filter}]
if 'skip' in kwargs:
    pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
    pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
           ('pipeline', pipeline),
           ('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
    kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
    result = self._aggregate_one_result(
        sock_info, slave_ok, cmd, collation, session)
if not result:
    return 0
return result['n']

collection.py#L1609-L1614 是帮助程序的发送器。

为什么PyMongo count_documents比count慢？

2 个答案: