使用LIKE选择可能重复的所有类似行?

时间:2011-04-08 02:14:29

标签: sql sqlite sql-like

将有关歌曲的信息导入我的SQLite数据库后,我想使用SELECT语句使用此条件查找所有可能的重复歌曲:

一行中的songName与同一表(歌曲)中任何其他行中的songName相似或相等,且两行中的artistID相同。这应该在不知道songName的内容的情况下工作。如果我想将已知的歌曲名称与数据库中的所有其他歌曲名称进行比较,可以使用“songName LIKE'%known name%'”进行比较,但是如何在没有这个名称的情况下找到所有重复项?

示例歌曲表:

id  songName            artistID  duration
--------------------------------------------
0  This is a song       5         3:43
1  Another song         3         3:23
2  01-This is a song    5         3:42
3  song                 4         4:01
4  song                 4         6:33
5  Another record       2         2:45

预期结果:

id  songName            artistID  duration
--------------------------------------------
0   This is a song      5         3:43
2   01-This is a song   5         3:42
3   song                4         4:01
4   song                4         6:33

修改

由于已经提出了创建哈希并比较它们的想法,我正在考虑使用这个psuedo-function为每个歌曲名称创建一个哈希:

Public Function createHash(ByVal phrase As String) As String
    'convert to lower case
    phrase = LCase(phrase)

    'split the phrase into words
    Dim words() As String = phrase.Replace("_", " ").Split(" ")

    Dim hash As String = ""
    For w = 0 To words.Count - 1
        'remove noise words (a, an, the, etc.)
        words(w) = removeNoiseWords(words(w))
        'convert 1 or 2-digit numbers to corresponding words
        words(w) = number2word(words(w))
    Next

    'rebuild using replaced words and remove spaces
    hash = String.Join("", words)

    'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
    hash = removeUnsupChars(hash, True)

    'strip away all remaining non-alphanumeric characters
    hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
    Return hash
End Function

计算完哈希后,我会将其存储在每条记录中,然后使用count(hash)> 1选择重复项。然后我将使用.NET代码来查看返回记录的artistID是否相同。

到目前为止,这个解决方案似乎工作正常。这是我用来查找重复歌曲的SQLite语句:

SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;

这给了我一个不止一次出现的所有哈希的列表。我将这些结果存储在一个数组中,然后在数组中循环,只需使用此语句获取详细信息:

    For i = 0 To dupeHashes.Count - 1
        SQLconnect.Open()
        SQLcommand = SQLconnect.CreateCommand
        SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
        SQLreader = SQLcommand.ExecuteReader()
        While SQLreader.Read()
            'get whatever data needed for each duplicate song
        End While
        SQLcommand.Dispose()
        SQLconnect.Close()
    Next

2 个答案:

答案 0 :(得分:2)

就个人而言,我会添加一个额外的字段来计算标题的某种“哈希”。一个很好的功能是剥离每个非字母字符,包括空格,删除任何文章(如“the”,“a”,“an”)然后为标题计算soundex code并在前面添加artistId string。

所以在你的情况下你会得到:

id  songName            artistID  duration  Hash
----------------------------------------------------
0  This is a song       5         3:43      5.T0021
1  Another song         3         3:23      3.A9872
2  01-This is a song    5         3:42      5.T0021
3  song                 4         4:01      4.S0332
4  song                 4         6:33      4.S0332
5  Another record       2         2:45      2.A7622

从现在开始,只获得包含... count(Hash)> 1的行应该很容易......

另请注意,我建议使用Soundex,但您可以创建自己的功能,或者调整现有功能,使某些元素比其他元素更具相关性。

答案 1 :(得分:0)

可以对这个问题有个简短的想法,但有一点需要澄清: 为什么结果没有 1另一首歌3 3:23 记录?因为它可以被视为与那些重复 3首歌曲4 4:01 4首歌曲4 6:33 记录?

我只是在tsql中编写一个简单的脚本来解决,效率很低,只需检查为参考。

 drop table #t;
drop table #result;

create table #t 
(
id int ,
songName varchar(100),
artistID int,
duration varchar(20)
)
insert into #t
select '0',  'This is a song'   ,    '5'  ,       '3:43' union all
select '1',  'Another song'     ,    '3'  ,       '3:23' union all
select '2',  '01-This is a song',    '5'  ,       '3:42' union all
select '3',  'song'             ,    '4'  ,       '4:01' union all
select '4',  'song'             ,    '4'  ,       '6:33' union all
select '5',  'Another record'   ,    '2'  ,       '2:45'

select * from #t
select * into #result from #t where 1 = 0

declare @sName varchar(100)
declare @id int
declare @count int

declare c cursor for 
select id, songName from #t

open c
fetch next from c into @id, @sName
while @@FETCH_STATUS = 0
begin
    select @count = COUNT(*) from #result where id = @id
    if @count = 0 
    begin
        select @count = COUNT(*) from #t where songName like '%'+@sName+'%'
        --select @count , @sName
        if @count > 1
        begin
            insert into #result select *  from #t where songName like '%'+@sName+'%' and id not in (select id from #result)
        end
    end
fetch next from c into @id, @sName
end
close c
deallocate c

select * from #result 
相关问题