Question

我在SQL Server 2008 R2中有三个表：

PRODUCTS (id int, title varchar(100), ....)  
WORDS (id int,word varchar(100) )  
WORDS_IN_TITLES (product_id int, word_id int)

现在我想选择标题中使用了某些单词的所有产品。

现在我这样做：

declare  @words tp_intList  
insert into @words values(154)  
insert into @words values(172)  
declare @wordsCnt int = (select count(*) from @words)    

select * from products where id IN  
(
 select product_id from WORDS_IN_TITLES inner join 
 (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id 
 group by product_id HAVING count(*)=@wordsCnt
)

它有效，但速度很慢。表包含600k行，返回3.5k行大约需要4秒。我需要它远远低于1秒。如何提高性能？

Answer 1

select products.*
    from products
        inner join (select p.id
                        from products p
                            inner join words_in_titles wit
                                on p.id = wit.product_id
                        where wit.word_id in (154,172)
                        group by p.id
                        having count(distinct wit.word_id) = 2) q
           on products.id = q.id

Answer 2

看起来您的查询不会有太大改进。

下面是一个示例表，其中包含600k行的产品和近600k行的words_in_titles。对于随机挑选的每2个word_id，应该有大约3到10个与该组合匹配的产品。

创建表格并填充数据。在words_in_titles（word_id）上创建索引

create table products (id int identity primary key clustered, title varchar(100))
insert into products
select convert(varchar(max),NEWID())
from master..spt_values a
inner join master..spt_values b on b.type='p' and b.number between 0 and 999
where a.type='P' and a.number between 0 and 600

create table words_in_titles (product_id int, word_id int,
    primary key clustered(product_id, word_id))
insert words_in_titles
select distinct a,b
from
(
select floor(convert(bigint,convert(varbinary(max),newid())) % 60000) a, floor(convert(bigint,convert(varbinary(max),newid())) % 1000) b
from master..spt_values a
inner join master..spt_values b on b.type='p' and b.number between 0 and 999
where a.type='P' and a.number between 0 and 600
) x

create index ix_words_in_titles on words_in_titles(word_id)

下次采用不同的方法。我们使用SET STATISTICS来查看内部统计信息。您还应检查执行计划（但不检查统计信息时 - 它会污染统计信息）。 DBCC命令用于刷新缓冲区和清除计划，将@clean位设置为1以在运行之间清除，0以模拟在白天运行期间数据可能已在缓冲区中。

declare @clean bit set @clean = 1
if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

set statistics io off
set statistics time off

-- pick two random word_id's as generated (@word1 and @word2 used below)
declare @word1 int, @word2 int
select top 1 @word1 = word_id from words_in_titles order by NEWID()
select top 1 @word2 = word_id from words_in_titles where word_id <> @word1 order by NEWID()

declare  @words table (id int)  
insert into @words values(@word1)  
insert into @words values(@word2)  
declare @wordsCnt int = (select count(*) from @words)    

set statistics io on
set statistics time on

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select *
from
(
select w.product_id
from words_in_titles w
where w.word_id = @word1
  and exists (select * from words_in_titles t where t.word_id=@word2 and t.product_id=w.product_id)
  -- expand with more EXISTS clauses
) q inner join products p on p.id = q.product_id

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select *
from
(
select w1.product_id
from words_in_titles w1
where w1.word_id = @word1
intersect
select w2.product_id
from words_in_titles w2
where w2.word_id = @word2
) q inner join products p on p.id = q.product_id

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select * from products where id IN  
(
 select product_id from WORDS_IN_TITLES inner join 
 (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id 
 group by product_id HAVING count(*)=@wordsCnt
)

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select products.*
    from products
        inner join (select p.id
                        from products p
                            inner join words_in_titles wit
                                on p.id = wit.product_id
                        where wit.word_id in (@word1,@word2)
                        group by p.id
                        having count(distinct wit.word_id) = 2) q
           on products.id = q.id

此处未显示执行计划，但如果您查看，则替代方案和原始查询都非常相似。
前两个查询从words_in_titles收集两次，但每个查询收集一个单独的word_id，然后使用不同的策略来计算交叉点
原始查询仅从words_in_titles和聚合中流式传输一次，但需要额外的I / O操作才能在tempdb中进行排序。
Joe的查询使用了更复杂的计划。

统计信息：查找以Table开头的批次以及之后的下一个SQL Server Execution Times。四个这样的片段将代表4个查询定时。

Table 'products'. Scan count 0, logical reads 30, physical reads 0, read-ahead reads 51, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'words_in_titles'. Scan count 2, logical reads 8, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#4D5F7D71'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 47 ms.

（如上所示，原始查询使用临时排序）如果你运行足够多次，你会看到最后一个查询总是比其余查询慢，第二个通常比第一个快，第三个（原始）有时在第一个和第二个之前或之后。

结论

您可以尝试使用其中一种替代方案，但查询不太可能有太大改进。

选择其他表中都存在的行 - 如何提高性能？

2 个答案:

结论