SQL Left仅加入第一个匹配

时间:2013-11-06 23:35:33

标签: sql sql-server tsql join greatest-n-per-group

我对大量具有大量连接的大表(行和列)进行查询,但是其中一个表有一些重复的数据行导致查询出现问题。由于这是来自其他部门的只读实时订阅源,因此我无法修复该数据,但我正在尝试阻止查询中的问题。

鉴于此,我需要将此垃圾数据作为左连接添加到我的好查询中。数据集如下所示:

IDNo    FirstName   LastName    ...
-------------------------------------------
uqx     bob     smith
abc     john        willis
ABC     john        willis
aBc     john        willis
WTF     jeff        bridges
sss     bill        doe
ere     sally       abby
wtf     jeff        bridges
...

(约二十几列,和100K行)

我的第一直觉是执行一个明显的给了我大约80K行:

SELECT DISTINCT P.IDNo
FROM people P

但是当我尝试以下操作时,我会收到所有行:

SELECT DISTINCT P.*
FROM people P

OR

SELECT 
    DISTINCT(P.IDNo) AS IDNoUnq 
    ,P.FirstName
    ,P.LastName
    ...etc.    
FROM people P

然后我想我会在所有列上执行FIRST()聚合函数,但这也感觉不对。从语法上讲,我在这里做错了吗?

更新 只是想注意:这些记录是基于上面列出的非密钥/非索引字段ID的重复记录。 ID是一个文本字段,虽然具有相同的值,但它与导致该问题的其他数据的情况不同。

7 个答案:

答案 0 :(得分:34)

distinct 一个函数。它始终在选择列表的所有列上运行。

你的问题是一个典型的"每组最大的N"使用窗口函数可以轻松解决的问题:

select ...
from (
  select IDNo,
         FirstName,
         LastName,
         ....,
         row_number() over (partition by lower(idno) order by firstname) as rn 
  from people 
) t
where rn = 1;

使用order by子句,您可以选择要选择的重复项。

以上内容可用于左连接:

select ...
from x
  left join (
    select IDNo,
           FirstName,
           LastName,
           ....,
           row_number() over (partition by lower(idno) order by firstname) as rn 
    from people 
  ) p on p.idno = x=idno and p.rn = 1
where ...

答案 1 :(得分:3)

添加标识列(PeopleID),然后使用相关子查询返回每个值的第一个值。

SELECT *
FROM People p
WHERE PeopleID = (
    SELECT MIN(PeopleID) 
    FROM People 
    WHERE IDNo = p.IDNo
)

答案 2 :(得分:2)

事实证明我做错了,我需要首先执行重要列的嵌套选择,并做一个明确的选择,以防止“唯一”数据的垃圾列破坏我的好数据。以下似乎已经解决了这个问题......但我稍后会尝试使用完整的数据集。

SELECT DISTINCT P2.*
FROM (
  SELECT
      IDNo
    , FirstName
    , LastName
  FROM people P
) P2

以下是一些请求的播放数据:http://sqlfiddle.com/#!3/050e0d/3

CREATE TABLE people
(
       [entry] int
     , [IDNo] varchar(3)
     , [FirstName] varchar(5)
     , [LastName] varchar(7)
);

INSERT INTO people
    (entry,[IDNo], [FirstName], [LastName])
VALUES
    (1,'uqx', 'bob', 'smith'),
    (2,'abc', 'john', 'willis'),
    (3,'ABC', 'john', 'willis'),
    (4,'aBc', 'john', 'willis'),
    (5,'WTF', 'jeff', 'bridges'),
    (6,'Sss', 'bill', 'doe'),
    (7,'sSs', 'bill', 'doe'),
    (8,'ssS', 'bill', 'doe'),
    (9,'ere', 'sally', 'abby'),
    (10,'wtf', 'jeff', 'bridges')
;

答案 3 :(得分:2)

根据重复行的性质,看起来您想要的只是对这些列具有区分大小写。在这些列上设置排序规则应该是您之后的:

SELECT DISTINCT p.IDNO COLLATE SQL_Latin1_General_CP1_CI_AS, p.FirstName COLLATE SQL_Latin1_General_CP1_CI_AS, p.LastName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM people P

http://msdn.microsoft.com/en-us/library/ms184391.aspx

答案 4 :(得分:1)

经过仔细考虑后,这个dillema有几个不同的解决方案:

汇总所有内容 在每列上使用聚合以获取最大或最小字段值。这就是我正在做的事情,因为它需要2个部分填写的记录并“合并”数据。

http://sqlfiddle.com/#!3/59cde/1

SELECT
  UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY 
  IDNo

获取第一个(或最后一个记录)

http://sqlfiddle.com/#!3/59cde/23

-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo:  Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------

SELECT 
   PC.user_id
  ,PData.FirstName
  ,PData.LastName
  ,PData.entry
FROM (
  SELECT 
      P2.user_id
     ,MIN(P2.entry) AS rownum
  FROM (
    SELECT
        UPPER(P.IDNo) AS user_id 
      , P.entry 
    FROM people P
  ) AS P2
  GROUP BY 
    P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY 
   PData.entry

答案 5 :(得分:0)

试试这个

 SELECT *
 FROM people P 
 where P.IDNo in (SELECT DISTINCT IDNo
              FROM people)

答案 6 :(得分:0)

使用 Cross Apply 或 Outer Apply,这样您就可以限制要从具有重复项的表中连接到第一次命中的数据量。

Select 
    x.*,
    c.*
from 
    x
Cross Apply 
    (
        Select 
            Top (1)
            IDNo,
            FirstName,
            LastName,
            ...., 
        from 
            people As p
        where 
            p.idno = x.idno
        Order By 
            p.idno //unnecessary if you don't need a specific match based on order
    ) As c

Cross Apply 像内连接,Outer Apply 像左连接

SQL Server CROSS APPLY and OUTER APPLY