需要帮助识别表中的重复项

时间:2015-04-17 20:18:38

标签: sql postgresql duplicates amazon-redshift

我拥有的:

  1. data_source_1
  2. data_source_2
  3. data_sources_view查看
  4. 关于表格:

    data_source_1

    没有重复:

    db=# select count(*) from (select distinct * from data_source_1);
    count 
    --------
    543243
    (1 row)
    
    db=# select count(*) from (select * from data_source_1);
    count 
    --------
    543243
    (1 row)
    

    data_source_2

    没有重复:

    db=# select count(*) from (select * from data_source_2);
    count 
    -------
    5304
    (1 row)
    
    db=# select count(*) from (select distinct * from data_source_2);
    count 
    -------
    5304
    (1 row)
    

    data_sources_view

    有重复:

    db=# select count(*) from (select distinct * from data_sources_vie);
    count 
    --------
    538714
    (1 row)
    
    db=# select count(*) from (select * from data_sources_view);
    count 
    --------
    548547
    (1 row)
    

    视图很简单:

    CREATE VIEW data_sources_view
    AS SELECT * 
    FROM (
          (
           SELECT a, b, 'data_source_1' as source
           FROM data_source_1
          )
          UNION ALL 
          ( 
           SELECT a, b, 'data_source_2' as source
           FROM data_source_2
          )
    );
    

    我想知道的是:

    • 如何在源表没有重复的视图中使用重复+ 'data_source_x' as source消除了重叠数据的可能性。
    • 如何识别重复?

    我尝试了什么:

    db# create table t1 as select * from data_sources_view;
    SELECT
    db=# 
    db=# create table t2 as select distinct * from data_sources_view;
    SELECT
    db=# create table t3 as select * from t1 minus select * from t2;
    SELECT
    db=# select 't1' as table_name, count(*) from t1 UNION ALL
    db-# select 't2' as table_name, count(*) from t2 UNION ALL
    db-# select 't3' as table_name, count(*) from t3;
    table_name | count 
    ------------+--------
    t1 | 548547
    t3 | 0
    t2 | 538714
    (3 rows)
    

    数据库:

    Redshift(PostgreSQL

2 个答案:

答案 0 :(得分:2)

原因是您的数据源有两列以上。如果你这样做了:

select count(*) from (select distinct a, b from data_source_1);

select count(*) from (select distinct a, b from data_source_2);

您应该会发现它们与您在同一张桌子上的count(*)不同。

答案 1 :(得分:0)

UNION vs UNION ALL

  1. UNION - 如果TOP查询中存在数据,则在底部查询中将其抑制。
  2. 输出

    FOO

    1. UNION ALL - 数据重复,因为两个表中都存在数据(显示两个记录)
    2. 输出

      FOO

      FOO