在数百万行表上执行聚合函数

时间:2010-05-12 16:26:55

标签: sql tsql sql-server-2008 aggregate large-data-volumes

我在数百万行表中遇到了一些严重的性能问题,我觉得我应该可以很快得到结果。这是我所拥有的,我如何查询它以及它需要多长时间:

  • 我正在运行SQL Server 2008 Standard,因此分区目前不是一个选项

  • 我正在尝试汇总过去30天内特定帐户的所有广告资源的所有观看次数。

  • 所有视图都存储在下表中:

CREATE TABLE [dbo].[LogInvSearches_Daily](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [Inv_ID] [int] NOT NULL,
    [Site_ID] [int] NOT NULL,
    [LogCount] [int] NOT NULL,
    [LogDay] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_LogInvSearches_Daily] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
  • 此表有132,000,000条记录,超过4场演出。

  • 表格中的10行样本:

ID                   Inv_ID      Site_ID     LogCount    LogDay
-------------------- ----------- ----------- ----------- -----------------------
1                    486752      48          14          2009-07-21 00:00:00
2                    119314      51          16          2009-07-21 00:00:00
3                    313678      48          25          2009-07-21 00:00:00
4                    298863      0           1           2009-07-21 00:00:00
5                    119996      0           2           2009-07-21 00:00:00
6                    463777      534         7           2009-07-21 00:00:00
7                    339976      503         2           2009-07-21 00:00:00
8                    333501      570         4           2009-07-21 00:00:00
9                    453955      0           12          2009-07-21 00:00:00
10                   443291      0           4           2009-07-21 00:00:00

(10 row(s) affected)
  • 我在LogInvSearches_Daily上有以下索引:
/****** Object:  Index [IX_LogInvSearches_Daily_LogDay]    Script Date: 05/12/2010 11:08:22 ******/
CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_LogDay] ON [dbo].[LogInvSearches_Daily] 
(
    [LogDay] ASC
)
INCLUDE ( [Inv_ID],
[LogCount]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
  • 我只需从库存中提取特定帐户ID的广告资源。我也有一个关于库存的索引。

我正在使用以下查询来聚合数据并给我前5条记录。此查询目前需要24秒才能返回5行:

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT TOP 5
    Sum(LogCount) AS Views
    , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID


(1 row(s) affected)

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Top(TOP EXPRESSION:((5)))
       |--Sequence Project(DEFINE:([Expr1007]=dense_rank))
            |--Segment
                 |--Segment
                      |--Sort(ORDER BY:([Expr1006] DESC, [D].[Inv_ID] DESC))
                           |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1006]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
                                |--Sort(ORDER BY:([D].[Inv_ID] ASC))
                                     |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                                          |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1011], [Expr1012], [Expr1010]))
                                          |    |--Compute Scalar(DEFINE:(([Expr1011],[Expr1012],[Expr1010])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                                          |    |    |--Constant Scan
                                          |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1011] AND [D].[LogDay] < [Expr1012]) ORDERED FORWARD)
                                          |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID]), SEEK:([propertyControlCenter].[dbo].[Inventory].[Acct_ID]=(18731) AND [propertyControlCenter].[dbo].[Inventory].[Inv_ID]=[LOA

(13 row(s) affected)

我尝试使用CTE首先获取行并聚合它们,但是运行速度不快,并且基本上给出了相同的执行计划。


(1 row(s) affected)
StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--SET SHOWPLAN_TEXT ON;
WITH getSearches AS (
        SELECT
            LogCount
--          , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank
            , D.Inv_ID
        FROM LogInvSearches_Daily D (NOLOCK)
            INNER JOIN propertyControlCenter.dbo.Inventory I (NOLOCK) ON Acct_ID = 18731 AND I.Inv_ID = D.Inv_ID
        WHERE 
            LogDay > DateAdd(d, -30, getdate())
--      GROUP BY Inv_ID
)

SELECT Sum(LogCount) AS Views, Inv_ID
FROM getSearches
GROUP BY Inv_ID


(1 row(s) affected)

StmtText
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1004]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
       |--Sort(ORDER BY:([D].[Inv_ID] ASC))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1008], [Expr1009], [Expr1007]))
                 |    |--Compute Scalar(DEFINE:(([Expr1008],[Expr1009],[Expr1007])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                 |    |    |--Constant Scan
                 |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1008] AND [D].[LogDay] < [Expr1009]) ORDERED FORWARD)
                 |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID] AS [I]), SEEK:([I].[Acct_ID]=(18731) AND [I].[Inv_ID]=[LOALogs].[dbo].[LogInvSearches_Daily].[Inv_ID] as [D].[Inv_ID]) ORDERED FORWARD)

(8 row(s) affected)


(1 row(s) affected)

所以考虑到我在执行计划中获得了良好的Index Seeks,我该怎么做才能让它更快地运行?

更新:

这是没有DENSE_RANK()的相同查询运行,它运行完全相同的24秒,并给我相同的基本查询计划:

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--SET SHOWPLAN_TEXT ON
SELECT TOP 5
    Sum(LogCount) AS Views
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID
ORDER BY Views, Inv_ID
(1 row(s) affected)

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Sort(TOP 5, ORDER BY:([Expr1006] ASC, [D].[Inv_ID] ASC))
       |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1006]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
            |--Sort(ORDER BY:([D].[Inv_ID] ASC))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                      |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1010], [Expr1011], [Expr1009]))
                      |    |--Compute Scalar(DEFINE:(([Expr1010],[Expr1011],[Expr1009])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                      |    |    |--Constant Scan
                      |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1010] AND [D].[LogDay] < [Expr1011]) ORDERED FORWARD)
                      |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID]), SEEK:([propertyControlCenter].[dbo].[Inventory].[Acct_ID]=(18731) AND [propertyControlCenter].[dbo].[Inventory].[Inv_ID]=[LOALogs].[dbo].[LogInvS

(9 row(s) affected)


谢谢,

3 个答案:

答案 0 :(得分:1)

我还没有读完你的整个问题(我很快就会谈到)但回答一个早期评论:你可以在SQL中使用分区视图 Server 2008标准版。它的分区(无可置疑地更灵活)仅限于企业版。

分区观看信息:http://msdn.microsoft.com/en-us/library/ms190019.aspx

在更广泛的问题上,我想知道你是否真的需要DENSE_RANK。我想知道你是否在DENSE_RANK中的ORDER BY和查询本身的ORDER BY之间感到困惑。目前,您的TOP 5将返回5 undefined 记录,因为除非指定了ORDER BY子句(您尚未完成),否则SQL Server不保证记录上的任何顺序。如果您将ORDER BY从DENSE_RANK向下移动到整个查询ORDER BY,如下所示,记录将按照我的意愿出现,它将消除对昂贵的DENSE_RANK聚合函数的需求。

SELECT TOP 5
    SUM([LogCount]) AS [Views],
    [Inv_ID]
FROM [LogInvSearches_Daily] D (NOLOCK)
WHERE 
    [LogDay] > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT *
        FROM Inventory (NOLOCK)
        WHERE Acct_ID = 18731
            AND Inv_ID = D.Inv_ID
    )
GROUP BY
    Inv_ID
ORDER BY
    [Views] DESC,
    [Inv_ID]

<强>更新

时间可能在这里用完了:

|--Sort(ORDER BY:([D].[Inv_ID] ASC))

您可以尝试创建这样的覆盖索引:

CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_Perf] ON [dbo].[LogInvSearches_Daily] 
(
    [Inv_ID] ASC,
    [LogDay] ASC
)
INCLUDE
(
    [LogCount]
)

请注意,我还略微更改了ORDER BY(Inv_ID现在已经排序为ASC而不是DESC)。我怀疑这种改变不会以有问题的方式影响结果,但可能有助于提高性能,因为它将以与它们分组相同的顺序返回行(尽管这可能是不相关的!)。

答案 1 :(得分:1)

除了分区,

根据我们使用比您更大的表的经验,我们将数据提取到临时表(而不是表变量)并在其上进行聚合。不是所有查询,而是更复杂的查询。

除此之外,我同意Daniel Renshaw关于DENSE_RANK的观察

我还考虑将[Inv_ID],[LogCount]移动到索引中(不包括,可能使用DESC排序)

答案 2 :(得分:0)

Acct_ID在Inventory表上,似乎有自己的索引(IX_Inventory_Acct_ID)。也许如果Inventory上有一个索引(Acct_Id,Inv_Id)并且LogInvSearches_Daily被聚集(或者至少被索引)(Inv_Id,LogDay),那么你会有更多的运气。

顺便说一句,我不知道LogInvSearches_Daily.ID上当前的聚类索引应该是什么给你买的。为什么导入要在磁盘上关闭具有关闭ID的记录?