在SQLAlchemy中使用多个计数优化左连接查询?

时间:2017-09-27 06:47:07

标签: count sqlalchemy left-join query-optimization mariadb

尝试优化查询,该查询对下级表中的对象有多个计数(在SQLAlchemy中使用了别名)。在Witch Academia术语中,类似这样的事情:

SELECT
  exam.id                AS exam_id,
  exam.name              AS exam_name,
  count(tried_witch.id)  AS tried,
  count(passed_witch.id) AS passed,
  count(failed_witch.id) AS failed
FROM exam
  LEFT OUTER JOIN witch AS tried_witch
    ON tried_witch.exam_id = exam.id AND
       tried_witch.is_failed = 0 AND
       tried_witch.status != "passed"
  LEFT OUTER JOIN witch AS passed_witch
    ON passed_witch.exam_id = exam.id AND
       passed_witch.is_failed = 0 AND
       passed_witch.status = "passed"
  LEFT OUTER JOIN witch AS failed_witch
    ON failed_witch.exam_id = exam.id AND
       failed_witch.is_failed = 1
GROUP BY exam.id, exam.name
ORDER BY tried ASC
LIMIT 20

女巫数量可能很大(数十万),考试数量较少(数百),因此上述查询相当慢。在很多类似的问题中,我找到了答案,提出了上述建议,但我觉得这里需要一种完全不同的方法。我坚持想出替代方案。注意,需要按计算的数量排序。当然,将零作为计数也是很重要的。 (不要注意一个有趣的模特:女巫可以轻松克隆自己去参加多项考试,因此每个考试的身份)

有一个EXISTS子查询,但没有反映在上面,并且不影响结果,情况是:

# Query_time: 1.135747  Lock_time: 0.000209  Rows_sent: 20  Rows_examined: 98174
# Rows_affected: 0
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: Yes
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0  Priority_queue: No

更新了查询,这仍然很慢:

SELECT
  exam.id              AS exam_id,
  exam.name            AS exam_name,
  count(CASE WHEN (witch.status != "passed" AND witch.is_failed = 0)
    THEN witch.id
        ELSE NULL END) AS tried,
  count(CASE WHEN (witch.status = "passed" AND witch.is_failed = 0)
    THEN witch.id
        ELSE NULL END) AS passed,
  count(CASE WHEN (witch.is_failed = 1)
    THEN witch.id
        ELSE NULL END) AS failed
FROM exam
  LEFT OUTER JOIN witch ON witch.exam_id = exam.id
GROUP BY exam.id, exam.name
ORDER BY tried ASC
LIMIT 20

1 个答案:

答案 0 :(得分:0)

索引是获得查询效果的关键 我根本不知道MariaDB,所以不确定可能性是什么。但如果它像Microsoft SQL Server那样,那么我会尝试这样:

  1. 创建一个涵盖所有必需列的综合索引:witch_idstatusis_failed。如果查询使用该索引,那应该是它。这里包含的列的顺序可能非常重要。然后对查询进行概要分析,以了解是否使用了索引。请参阅Optimization and Indexes文档页面。

  2. 考虑Generated (Virtual and Persistent) Columns 看起来witchtriedpassedfailed分类的所有分类信息都包含在witch的行中。因此,您基本上可以直接在数据库表上创建virtual列,并使用PERSISTENT选项。此选项允许在其上创建索引。然后,您可以专门为包含witch_id和三个虚拟列的查询创建索引:triedpassedfailed。确保查询使用它,这应该是相当不错的。然后查询看起来很简单:

    SELECT      exam.id,
                exam.name,
                sum(witch.tried) AS tried,
                sum(witch.passed) AS passed,
                sum(witch.failed) AS failed
    FROM        exam
    INNER JOIN  witch ON exam.id = witch.exam_id
    GROUP BY    exam.id,
                exam.name 
    ORDER BY    sum(witch.tried)
    LIMIT       20
    
  3. 虽然查询简单比较和AND / OR子句,但您基本上是在INSERT / UPDATE期间将3个状态的计算卸载到数据库。然后在SELECT期间,您的查询应该更快。

    您的示例未指定任何结果过滤(WHERE子句),但如果您有一个,则可能会对优化索引的查询性能的方式产生影响。 < / p>

    原始回答:以下是最初建议的查询更改 这里我假设优化的索引部分已经完成。

    您可以尝试使用SUM代替COUNT吗?

    SELECT exam.id,
           exam.name,
           sum(CASE
                   WHEN (witch.is_failed = 0
                         AND witch.status != 'passed') THEN 1
                   ELSE 0
               END) AS tried,
           sum(CASE
                   WHEN (witch.is_failed = 0
                         AND witch.status = 'passed') THEN 1
                   ELSE 0
               END) AS passed,
           sum(CASE
                   WHEN (witch.is_failed = 1) THEN 1
                   ELSE 0
               END) AS failed
    FROM exam
    INNER JOIN witch ON exam.id = witch.exam_id
    GROUP BY exam.id,
             exam.name
    ORDER BY sum(CASE
                     WHEN (witch.is_failed = 0
                           AND witch.status != 'passed') THEN 1
                     ELSE 0
                 END)
    LIMIT 20
    

    其余的: 鉴于您在答案中指定了sqlalchemy,这里是sqlalchemy代码,我用它来建模并生成查询:

    # model
    class Exam(Base):
        id = Column(Integer, primary_key=True)
        name = Column(String)
    
    
    class Witch(Base):
        id = Column(Integer, primary_key=True)
    
        exam_id = Column(Integer, ForeignKey('exam.id'))
        is_failed = Column(Integer)
        status = Column(String)
    
        exam = relationship(Exam, backref='witches')
    
    
        # computed fields
        @hybrid_property
        def tried(self):
            return self.is_failed == 0 and self.status != 'passed'
    
        @hybrid_property
        def passed(self):
            return self.is_failed == 0 and self.status == 'passed'
    
        @hybrid_property
        def failed(self):
            return self.is_failed == 1
    
    
        # computed fields: expression
        @tried.expression
        def _tried_expression(cls):
            return case([(and_(
                cls.is_failed == 0,
                cls.status != 'passed',
            ), 1)], else_=0)
    
        @passed.expression
        def _passed_expression(cls):
            return case([(and_(
                cls.status == 'passed',
                cls.is_failed == 0,
            ), 1)], else_=0)
    
        @failed.expression
        def _failed_expression(cls):
            return case([(cls.is_failed == 1, 1)], else_=0)
    

    # query
    q = (
        session.query(
            Exam.id, Exam.name,
            func.sum(Witch.tried).label("tried"),
            func.sum(Witch.passed).label("passed"),
            func.sum(Witch.failed).label("failed"),
        )
        .join(Witch)
        .group_by(Exam.id, Exam.name)
        .order_by(func.sum(Witch.tried))
        .limit(20)
    )
    
相关问题