重复数据库记录比较众多字段中的值

时间:2013-08-15 09:49:30

标签: mysql sql deduplication

所以我正在尝试清理数据库表中的一些电话记录。

我已经找到了如何使用以下方法在2个字段中找到完全匹配的内容:

/* DUPLICATE first & last names */

SELECT 
    `First Name`, 
    `Last Name`, 
     COUNT(*) c 
FROM phone.contacts  
GROUP BY 
    `Last Name`, 
    `First Name` 
HAVING c > 1;
哇,太棒了。

我想进一步扩展以查看多个字段,以查看3个电话字段中的1个电话号码是否重复。

所以我想查看3个字段(general mobilegeneral phonebusiness phone)。

1.看到他们不是空的('') 2.查看其中任何一个数据(数字)是否出现在表格中任何位置的其他2个电话字段中。

因此推动我的有限SQL超出其限制我想出了以下内容,它似乎返回了3个空手机字段和记录的记录。还有没有重复电话号码的记录。

/* DUPLICATE general & business phone nos */

SELECT 
    id, 
   `first name`, 
   `last name`, 
   `general mobile`, 
   `general phone`, 
   `general email`, 
   `business phone`, 
    COUNT(CASE WHEN `general mobile` <> '' THEN 1 ELSE NULL END) as gen_mob, 
    COUNT(CASE WHEN `general phone` <> '' THEN 1 ELSE NULL END) as gen_phone,
    COUNT(CASE WHEN `business phone` <> '' THEN 1 ELSE NULL END) as bus_phone 
FROM phone.contacts 
GROUP BY 
   `general mobile`, 
   `general phone`, 
   `business phone` 
HAVING gen_mob > 1 OR gen_phone > 1 OR bus_phone > 1;

显然我的逻辑是有缺陷的&amp;我想知道是否有人可以指出我正确的方向/怜惜等...

非常感谢

3 个答案:

答案 0 :(得分:5)

你要做的第一件事是拍摄那些用空格命名列的人。

现在,试试这个:

SELECT DISTINCT
   c.id, 
   c.`first name`, 
   c.`last name`, 
   c.`general mobile`, 
   c.`general phone`, 
   c.`business phone`
from contacts_test c
join contacts_test c2
    on (c.`general mobile`!= '' and c.`general mobile` in (c2.`general phone`, c2.`business phone`))
    or (c.`general phone` != '' and c.`general phone` in (c2.`general mobile`, c2.`business phone`))
    or (c.`business phone`!= '' and c.`business phone` in (c2.`general mobile`, c2.`general phone`))

在SQLFiddle中查看此查询的live demo

请注意phone != ''的额外检查,这是必需的,因为电话号码不可为空,因此其“未知”值为空。如果没有此检查,则返回错误匹配,因为当然空白等于空白。

如果有多个其他行匹配,则添加DISTINCT关键字,这将导致nxn结果集。

答案 1 :(得分:1)

根据我的经验,在清理数据时,理解数据视图以及管理数据的简单方法要好得多,而不是要有一个庞大而庞大的查询来同时执行所有分析。

您还可以(或多或少)重新规范数据库,使用类似:

Create view VContactsWithPhones
as
Select id, 
       `Last Name` as LastName, 
       `First Name` as FirstName,
       `General Mobile` as Phone,
       'General Mobile' as PhoneType
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `General Phone`,
       'General Phone'
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `Business Phone`,
       'Business Phone'
From phone.contacts c

这将生成一个视图,其中包含原始表格的三倍,但带有Phone列,可以是三种类型之一。

您可以轻松地从该视图中进行选择:

//empty phones
SELECT * 
FROM VContactsWithPhones 
Where Phone is null or Phone = ''

//duplicate phones
Select Phone, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone
having count(*) > 1

//duplicate phones belonging to the same ID (double entries)
Select Phone, ID, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone, ID
having count(*) > 1

//duplicate phones belonging to the different ID (duplicate entries)
Select v1.Phone, v1.ID, v1.PhoneType, v2.ID, v2.PhoneType
from VContactsWithPhones v1
   inner join VContactsWithPhones v2 
     on v1.Phone=v2.Phone and v1.ID=v2.ID
where v1.Phone is not null and v1.Phone <> ''

等等...

答案 2 :(得分:0)

您可以尝试以下内容:

SELECT * from phone.contacts p WHERE `general mobile` IN (SELECT `general mobile` FROM phone.contacts WHERE id != p.id UNION SELECT `general phone` FROM phone.contacts WHERE id != p.id UNION SELECT `general email` FROM phone.contacts WHERE id != p.id)

每次重复3次:general mobilegeneral phonegeneral email。它可以放在一个查询中,但可读性较差。