限制2元组中元素的出现次数

时间:2016-05-05 13:15:04

标签: sql postgresql postgresql-9.4

我试图在SQL(或postgresql 9.4)中找到一个基于集合的查询解决方案,以解决以下问题:

我有一组有限的唯一2元组(x∈N,y∈N),它们已经分配了等级。

现在我想删除所有元组,以便剩余的元组满足以下条件:

  1. 每个数字在元组左侧最多出现n次
  2. 每个数字在右侧最多出现m次。
  3. 对于迭代有序元组的过程并计算每个元素的出现次数,这很容易做到。但是,我想知道是否有单个(postgre)SQL查询的解决方案?

    更具体地说,请考虑以下简单示例,其中n = 2,m = 2:

    ╔═══╦═══╦══════╗
    ║ x ║ y ║ rank ║
    ╠═══╬═══╬══════╣
    ║ 1 ║ 4 ║    1 ║
    ║ 2 ║ 4 ║    2 ║
    ║ 3 ║ 4 ║    3 ║
    ║ 3 ║ 5 ║    4 ║
    ║ 3 ║ 6 ║    5 ║
    ║ 3 ║ 7 ║    6 ║
    ╚═══╩═══╩══════╝
    

    现在我们正在寻找一个返回以下元组的查询:(1,4),(2,4),(3,5),(3,6)

    表和值的SQL小提琴:

       create table tab (
         x bigint,
         y bigint,
         rank bigint);
    
      insert into tab values (1,4,1);
      insert into tab values (2,4,2);
      insert into tab values (3,4,3);
      insert into tab values (3,5,4);
      insert into tab values (3,6,5);
      insert into tab values (3,7,6);
    

    我尝试过使用postgres窗口函数的方法,它解决了上面的例子,但我不确定它是否可以找到与其他示例的基于游标的方法一样多的对。

        SELECT x, y FROM (
          SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY rank) AS rx FROM (
            SELECT x, y, rank, ROW_NUMBER() OVER (PARTITION BY y ORDER BY rank) AS ry FROM tab) AS limitY
          WHERE limitY.ry < 3) AS limitX
        WHERE limitX.rx < 3
    

2 个答案:

答案 0 :(得分:0)

这是使用单个窗函数传递的变体(可能更快):

select x, y, rank
from (
  select *, count(*) over (partition by x order by rank) as cx,
            count(*) over (partition by y order by rank) as cy
  from tab
  order by rank
  ) t
where cx < 3 and cy < 3;

还有递归CTE方法:

-- use tab directly instead of tabr CTE (and replace all ocurrences of r column with rank)
-- if rank is trusted to be sequential uninterrupted starting with 1
with recursive
  r (r, x, y, rank, cx, cy) as (
    select *, 1 as cx, 1 as cy
    from tabr where r = 1
    union all
    select t.*, case when r.x = t.x then r.cx + 1 else 1 end as cx, case when r.y = t.y then r.cy + 1 else 1 end as cy
    from r, tabr t
    where t.r = r.r + 1
    ),
  tabr as (
    select row_number() over () as r, *
    from tab
    order by rank
    )
select x, y, rank
from r
where cx <= 2 and cy <= 2
order by r;

答案 1 :(得分:0)

这个花了一段时间,但我能够在MS SQL Server中找到一个解决方案,我认为它应该转换为PostGreSQL。 SQL Server对递归CTE中的内容有一些限制,我不完全知道PostGreSQL有什么约束。也就是说,希望这对您有用或指向正确的方向。

棘手的部分是排除的行根据已排除的行而发生变化,因此无法简单地计算它们,因为它们依赖于x y,递归CTE无法按顺序构建,因为它只能引用一次。当我提出将计数嵌入字符串的想法时。这根本不能很好地扩展 - 例如,如果规则在排除行之前更改为3或4个实例,则CASE语句开始爆炸。

WITH CTE_Excludes AS
(
    SELECT
        x,
        y,
        [rank],
        CAST('|' + CAST(x AS VARCHAR(4)) + '-1|' AS VARCHAR(1000)) AS x_counts,
        CAST('|' + CAST(y AS VARCHAR(4)) + '-1|' AS VARCHAR(1000)) AS y_counts,
        0 AS excluded
    FROM
        tab
    WHERE
        [rank] = 1
    UNION ALL
    SELECT
        T.x,
        T.y,
        T.[rank],
        CAST(CASE
            WHEN X.x_counts LIKE '%|' + CAST(T.x AS VARCHAR(4)) + '-2|%' OR X.y_counts LIKE '%|' + CAST(T.y AS VARCHAR(4)) + '-2|%' THEN X.x_counts
            WHEN X.x_counts LIKE '%|' + CAST(T.x AS VARCHAR(4)) + '-1|%' THEN REPLACE(X.x_counts, '|' + CAST(T.x AS VARCHAR(4)) + '-1|', '|' + CAST(T.x AS VARCHAR(4)) + '-2|')
            ELSE X.x_counts + '|' + CAST(T.x AS VARCHAR(4)) + '-1|'
        END AS VARCHAR(1000)) AS x_counts,
        CAST(CASE
            WHEN X.x_counts LIKE '%|' + CAST(T.x AS VARCHAR(4)) + '-2|%' OR X.y_counts LIKE '%|' + CAST(T.y AS VARCHAR(4)) + '-2|%' THEN X.y_counts
            WHEN X.y_counts LIKE '%|' + CAST(T.y AS VARCHAR(4)) + '-1|%' THEN REPLACE(X.y_counts, '|' + CAST(T.y AS VARCHAR(4)) + '-1|', '|' + CAST(T.y AS VARCHAR(4)) + '-2|')
            ELSE X.y_counts + '|' + CAST(T.y AS VARCHAR(4)) + '-1|'
        END AS VARCHAR(1000)) AS y_counts,
        CASE
            WHEN X.x_counts LIKE '%|' + CAST(T.x AS VARCHAR(4)) + '-2|%' OR X.y_counts LIKE '%|' + CAST(T.y AS VARCHAR(4)) + '-2|%' THEN 1
            ELSE 0
        END AS excluded
    FROM
        CTE_Excludes X
    INNER JOIN tab T ON T.[rank] = X.[rank] + 1
)
SELECT
    x, y
FROM
    CTE_Excludes
WHERE
    excluded = 0