超慢查询 - 加速,但不完美...请帮助

时间:2010-12-17 17:00:03

标签: sql join performance distinct

我昨天发布了一个查询(见here),这个查询非常糟糕(花了一分钟才能运行,产生了18,215条记录):

SELECT DISTINCT 
    dbo.contacts_link_emails.Email, dbo.contacts.ContactID, dbo.contacts.First AS ContactFirstName, dbo.contacts.Last AS ContactLastName, dbo.contacts.InstitutionID, 
    dbo.institutionswithzipcodesadditional.CountyID, dbo.institutionswithzipcodesadditional.StateID,  dbo.institutionswithzipcodesadditional.DistrictID
FROM         
    dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3 
INNER JOIN
    dbo.contacts 
INNER JOIN
    dbo.contacts_link_emails 
        ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID 
        ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle 
INNER JOIN
    dbo.institutionswithzipcodesadditional 
        ON dbo.contacts.InstitutionID = dbo.institutionswithzipcodesadditional.InstitutionID 
LEFT OUTER JOIN
    dbo.contacts_def_jobfunctions 
INNER JOIN
    dbo.contacts_link_jobfunctions 
        ON dbo.contacts_def_jobfunctions.JobID = dbo.contacts_link_jobfunctions.JobID 
        ON dbo.contacts.ContactID = dbo.contacts_link_jobfunctions.ContactID
WHERE     
        (dbo.contacts.JobTitle IN
        (SELECT     JobID
        FROM          dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_1
        WHERE      (ParentJobID <> '1841'))) 
    AND
        (dbo.contacts_link_emails.Email NOT IN
        (SELECT     EmailAddress
        FROM          dbo.newsletterremovelist)) 
OR
        (dbo.contacts_link_jobfunctions.JobID IN
        (SELECT     JobID
        FROM          dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_2
        WHERE      (ParentJobID <> '1841')))
    AND 
        (dbo.contacts_link_emails.Email NOT IN
        (SELECT     EmailAddress
        FROM          dbo.newsletterremovelist AS newsletterremovelist)) 
ORDER BY EMAIL

经过大量的指导和研究,我已将其调整到以下几点:

SELECT  contacts.ContactID,
        contacts.InstitutionID,
        contacts.First,
        contacts.Last,
        institutionswithzipcodesadditional.CountyID, 
        institutionswithzipcodesadditional.StateID,
        institutionswithzipcodesadditional.DistrictID
FROM    contacts 
    INNER JOIN contacts_link_emails ON 
    contacts.ContactID = contacts_link_emails.ContactID
    INNER JOIN institutionswithzipcodesadditional ON
    contacts.InstitutionID = institutionswithzipcodesadditional.InstitutionID
WHERE
    (contacts.ContactID IN
        (SELECT contacts_2.ContactID
        FROM contacts AS contacts_2
        INNER JOIN contacts_link_emails AS contacts_link_emails_2 ON
            contacts_2.ContactID = contacts_link_emails_2.ContactID
        LEFT OUTER JOIN contacts_def_jobfunctions ON 
            contacts_2.JobTitle = contacts_def_jobfunctions.JobID
        RIGHT OUTER JOIN newsletterremovelist ON 
            contacts_link_emails_2.Email = newsletterremovelist.EmailAddress
        WHERE (contacts_def_jobfunctions.ParentJobID <> 1841)
        GROUP BY contacts_2.ContactID
    UNION
        SELECT contacts_1.ContactID
        FROM contacts_link_jobfunctions
        INNER JOIN contacts_def_jobfunctions AS contacts_def_jobfunctions_1 ON
            contacts_link_jobfunctions.JobID = contacts_def_jobfunctions_1.JobID 
            AND contacts_def_jobfunctions_1.ParentJobID <> 1841 
        INNER JOIN contacts AS contacts_1 ON 
            contacts_link_jobfunctions.ContactID = contacts_1.ContactID
        INNER JOIN contacts_link_emails AS contacts_link_emails_1 ON
            contacts_link_emails_1.ContactID = contacts_1.ContactID
        LEFT OUTER JOIN newsletterremovelist AS newsletterremovelist_1 ON
        contacts_link_emails_1.Email = newsletterremovelist_1.EmailAddress
        GROUP BY contacts_1.ContactID))

虽然这个查询现在速度非常快(大约3秒),但我已经把部分逻辑炸成了 - 它只返回14,863行(而不是我认为准确的18,215行)。

结果似乎接近正确。我正在努力发现结果集中可能缺少哪些数据。

你可以指导我完成我在这里做错的事吗?

谢谢,

Russell Schutte

3 个答案:

答案 0 :(得分:2)

您的原始查询的主要问题是,您有两个额外的连接只是为了引入重复项,然后是DISTINCT来删除它们。

使用此:

SELECT  cle.Email,
        c.ContactID,
        c.First AS ContactFirstName,
        c.Last AS ContactLastName,
        c.InstitutionID, 
        izip.CountyID,
        izip.StateID, 
        izip.DistrictID
FROM    dbo.contacts c
INNER JOIN
        dbo.institutionswithzipcodesadditional izip
ON      izip.InstitutionID = c.InstitutionID
INNER JOIN
        dbo.contacts_link_emails cle
ON      cle.ContactID = c.ContactID 
WHERE   cle.Email NOT IN
        (
        SELECT  EmailAddress
        FROM    dbo.newsletterremovelist
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    dbo.contacts_def_jobfunctions cdj
        WHERE   cdj.JobId = c.JobTitle
                AND cdj.ParentJobId <> '1841'
        UNION ALL
        SELECT  NULL
        FROM    dbo.contacts_link_jobfunctions clj
        JOIN    dbo.contacts_def_jobfunctions cdj
        ON      cdj.JobID = clj.JobID
        WHERE   clj.ContactID = c.ContactID
                AND cdj.ParentJobId <> '1841'
        )
ORDER BY
        email

创建以下索引:

newsletterremovelist (EmailAddress)
contacts_link_jobfunctions (ContactID, JobID)
contacts_def_jobfunctions (JobID)

答案 1 :(得分:0)

我不确定是什么问题,但是当我遇到这种情况时,我要做的第一件事就是开始删除变量。

所以,注释掉where子句。返回多少行?

如果您返回11,604行,那么您已将问题隔离到连接。通过连接工作,对每个连接进行注释(也删除相关的列)并确定消除了多少行。

当您这样做时,目标是找到导致所需行被消除的原因。隔离后,请考虑第一个查询和第二个查询之间的连接差异。


在查看第一个查询时,您可以修改它以消除任何IN,而是执行EXISTS

考虑你的索引。应该将where或join子句中的任何内容编入索引。

答案 2 :(得分:0)

当你这样做时,你会得到相同的结果:

SELECT count(*)
FROM          
    dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3  
INNER JOIN 
    dbo.contacts  
INNER JOIN 
    dbo.contacts_link_emails  
        ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID  
        ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle  
SELECT COUNT(*)        
FROM        
    contacts 
INNER JOIN contacts_link_jobfunctions 
    ON contacts.ContactID = contacts_link_jobfunctions.ContactID 
INNER JOIN  contacts_link_emails 
    ON contacts.ContactID = contacts_link_emails.ContactID 

如果是这样,继续添加每个加入条件,直到你得不到相同的结果,你会看到你的错误在哪里。如果所有连接都相同,那么查看where子句。但是如果它不在第一个连接中,我会感到惊讶,因为你所拥有的语法甚至不能在SQL Server上运行,而且它是非常非标准的SQL,并且可能一直都是错误的,但没有人知道。

或者,选择一些在原始但未修订的记录中返回的记录。一次一个地跟踪它们,看看你是否能找到第二个查询过滤掉它们的原因。

相关问题