当distinct不是连接选项时,我该怎么办?

时间:2017-01-09 18:05:28

标签: sql postgresql

我正在处理一个非常大的数据集并且现在遇到一个问题,我不确定当前的方法是否可以解决。我很好地发布这个,因为我没有提出最初的例子,但我们的任务是接受它。此时重新编写逻辑将是一个非常重要的步骤。

该项目在数据仓库上运行报告,但为了使事情更加友好,我创建了一个示例来说明我遇到的问题。

CREATE TEMPORARY TABLE test_customers2 (
    id              integer PRIMARY KEY,
    first_name      varchar(40) NOT NULL,
    last_name       varchar(40) NOT NULL,
    newsletter      integer NOT NULL,
    vipmember       integer NOT NULL
);

INSERT INTO test_customers2 VALUES(1, 'Reed', 'Richards', 1, 1);
INSERT INTO test_customers2 VALUES(2, 'Johnny', 'Storm', 0, 1);
INSERT INTO test_customers2 VALUES(3, 'Peter', 'Parker', 1, 0);

CREATE TEMPORARY TABLE test_purchases (
    id        integer CONSTRAINT firstkey2 PRIMARY KEY,
    cid       integer NOT NULL
);

INSERT INTO test_purchases VALUES(1, 1);
INSERT INTO test_purchases VALUES(2, 2);
INSERT INTO test_purchases VALUES(3, 2);
INSERT INTO test_purchases VALUES(4, 3);

SELECT 
    COUNT(distinct c.id) as "Total Customers"
    ,COUNT(distinct p.id) as "Total Sales"
    ,COUNT(distinct p.id)::decimal/COUNT(distinct c.id)::decimal as "Sales per customer"
    ,SUM(c.newsletter) as "Subscribed"
    ,SUM(c.newsletter)::decimal/COUNT(c.newsletter)::decimal as "Pct Subscribed"
    ,SUM(c.vipmember) as "VIP"
    ,SUM(c.vipmember)::decimal/COUNT(c.vipmember)::decimal as "Pct VIP"
FROM test_customers2 c
    INNER JOIN test_purchases p ON c.id = p.cid

当你在最后执行SELECT时,你会得到结果:

3 | 4 | 1.33... | 2 | 0.50... | 3 | 0.75...

问题是,由于加入,它正在抛弃我的结果,因为我真的在寻找这些结果:

3 | 4 | 1.33... | 2 | 0.66... | 2 | 0.66...

distinct有助于唯一值,但布尔值(在本例中字面意思是int,未指定为boolean)不适用于该方法,因为它们只有可选值为1,0或null。我想我可能需要对它进行子查询,但除了性能下降之外,重写大量代码也会有点受欢迎。还有其他更好的方法可能会丢失吗?

3 个答案:

答案 0 :(得分:2)

问题在于,您只是为了将单独的表中的列添加到行集中而执行连接 - 您实际上并未实际使用两个源表之间的关系,也不是你想做什么。总体而言,这只是因为您希望关联聚合数据的各个方面,以及 您应该加入的数据。

我建议在单独的内联视图/ CTE中计算单表统计信息,然后(交叉)连接两个单行结果以获得另一个单行来执行最终选择。像这样的东西,例如:

SELECT 
    c.c_count as "Total Customers"
    ,p.p_count as "Total Sales"
    ,p.p_count::decimal/c.c_count::decimal as "Sales per customer"
    ,c.nl_sum as "Subscribed"
    ,c.nl_sum::decimal/c.c_count::decimal as "Pct Subscribed"
    ,c.vipsum as "VIP"
    ,c.vipsum::decimal/c.c_count::decimal as "Pct VIP"
FROM
  (
    SELECT
      count(*) as c_count,
      sum(newsletter) as nl_sum,
      sum(vipmember) as vip_sum
    FROM test_customers2
  ) c
  CROSS JOIN
  (
    SELECT COUNT(*) AS p_count FROM test_purchases
  ) p

答案 1 :(得分:0)

您实际上并不需要加入。您的逻辑都不需要匹配2个表。这是MSSQL中的查询(抱歉,我不知道Postgres),但我认为你可以翻译。

SELECT COUNT(*) as "Total Customers",
    (SELECT COUNT(*) FROM test_purchases) as "Total Sales",
    CAST((SELECT COUNT(*) FROM test_purchases) AS DECIMAL) / COUNT(*) as "Sales per Customer",
    SUM(c.newsletter) as "Suscribed",
    CAST(SUM(c.newsletter) AS DECIMAL) / COUNT(*) as "Pct Suscribed",
    SUM(c.vipmember) as "VIP",
    CAST(SUM(c.newsletter) AS DECIMAL) / COUNT(*) as "Pct VIP"
FROM test_customers2 c

答案 2 :(得分:0)

可能更多"灵活":

SELECT 
    COUNT(c.id) as "Total Customers"
    ,SUM(p.total_sales) as "Total Sales"
    ,SUM(p.total_sales)::decimal/COUNT(c.id)::decimal as "Sales per customer"
    ,SUM(c.newsletter) as "Subscribed"
    ,SUM(c.newsletter)::decimal/COUNT(c.newsletter)::decimal as "Pct Subscribed"
    ,SUM(c.vipmember) as "VIP"
    ,SUM(c.vipmember)::decimal/COUNT(c.vipmember)::decimal as "Pct VIP"
FROM test_customers2 c
    JOIN (select cid, count(*) as total_sales from test_purchases group by cid) p ON c.id = p.cid
相关问题