分组一些共同的价值观

时间:2018-01-16 22:23:35

标签: sql sql-server

这是一个难以解释的问题,但我正在尝试创建一个SQL查询,该查询生成一个父组列表,其中包含至少一个组与另一个组共享产品的所有组。但是他们不是所有人都必须共享产品,只要另一个群体他们将被包括在父组中。

例如:因为组1具有{101,102,103}而组5具有{101,104,105},所以它们将被视为同一父组的一部分,因为它们共享 产品101的共同点。第4组{104}也是如此,因为它与第5组具有共同的产品104(即使它没有与第1组共同的产品ID)。

示例数据:

group_id    product_id
1           101
1           102
1           103
2           101
3           103
4           104
5           101
5           104
5           105
6           105
6           106
6           107
7           110
7           111

结果:

parent_group_id     group_id
1                   1
1                   2
1                   3
1                   4
1                   5
1                   6
2                   7

对于可以在组下列出的产品数量没有实际限制。

我不确定如何解决这个问题。也许使用CTE进行递归?

理想情况下,我希望能够动态执行此操作,以便找到所有链接的产品并将它们作为一个大集合进行查询。

编辑:

我根据劳尔的答案提出以下解决方案。改变是在底层CTE。在他们的解决方案中,group_id的值和分组可能会“错过”。例如,在下面的数据集中,组2不会看到父组ID为1,因为链接2到1(5,6和8)的组的组ID大于2。我的解决方案是仅使用直接自我加入产品ID。这解决了这个问题,但是当我使用150K行的测试数据集时,性能是残酷的(在30分钟后停止)。在制作中我可以期待数百万。

我尝试将bottomLevel CTE放入临时表并在其上放置一个索引,这对较小的数据集有所帮助,但在整个集合上仍然太慢。

我在这里运气不好吗?

CREATE TABLE #products
(
    group_id int not null,
    product_id int not null
)

INSERT INTO #products
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 110)
,(2, 111)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(8,106)
,(8,111)
,(9,201)
,(10,300)
,(11,300)
,(11,301)

CREATE CLUSTERED INDEX cx_prods ON #products (group_id,product_id);

----------------------------------------------------------------

;WITH bottomLevel AS (
     SELECT DISTINCT        
        sp.group_id as parent_group_id
        ,p.group_id

    FROM 
        #products p
        inner JOIN 
        #products sp
            ON          
            sp.product_id = p.product_id

),
rc AS (
   SELECT parent_group_id
      , group_id
   FROM bottomLevel
   UNION ALL
   SELECT b.parent_group_id
      , r.group_id
   FROM rc r
   INNER JOIN bottomLevel b
   ON r.parent_group_id = b.group_id
   AND b.parent_group_id < r.parent_group_id
)


SELECT MIN(parent_group_id) as parent_group_id
, group_id
FROM rc
GROUP BY group_id
ORDER BY group_id

OPTION (MAXRECURSION 32767)

DROP TABLE #products

2 个答案:

答案 0 :(得分:1)

将劳尔的答案标记为已被接受,因为它帮助我找到了正确的方向。

但对于那些后来可能会发现这一点的人来说,这就是我所做的。

基于劳尔答案的CTE方法有效,但对我的需求来说太慢了。我探讨了在SQL Server 2017中使用新的图形功能,但它还不支持传递闭包。那里没有运气。但它确实为我提供了一个搜索术语:传递闭包聚类。我在SQL Server中找到了以下两篇关于它的文章。

来自Davide Mauri的这篇文章: http://sqlblog.com/blogs/davide_mauri/archive/2017/11/12/lateral-thinking-transitive-closure-clustering-with-sql-server-uda-and-json.aspx

来自Itzik Ben-Gan的这个: http://www.itprotoday.com/microsoft-sql-server/t-sql-puzzle-challenge-grouping-connected-items

两者都非常有助于理解问题,但我使用了Ben-Gan的解决方案4。

它使用while循环展开连接的节点,并在运行时从临时输入表中删除已处理的边。 它在中小型设备上运行速度非常快,并且可以很好地扩展。我的1.2米行的测试数据在2分钟内运行。

这是我的版本:

首先创建一个表来存储测试数据:

CREATE TABLE [dbo].[GroupsToProducts](
    [group_id] [INT] NOT NULL,
    [product_id] [INT] NOT NULL,
 CONSTRAINT [PK_GroupsToProducts] PRIMARY KEY CLUSTERED 
(
    [group_id] ASC,
    [product_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

INSERT INTO GroupsToProducts
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 110)
,(2, 111)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(8,106)
,(8,111)
,(9,201)
,(10,300)
,(11,300)
,(11,301)

然后运行脚本以生成集群。

CREATE TABLE #group_rels
(
    from_group_id int not null,
    to_group_id int not null
)

INSERT INTO #group_rels

SELECT 
    p.group_id AS from_group_id,
    sp.group_id AS to_group_id

FROM 
    GroupsToProducts p
    inner JOIN 
    GroupsToProducts sp
        ON          
        sp.product_id = p.product_id
        AND p.group_id < sp.group_id
GROUP BY 
    p.group_id,
    sp.group_id

CREATE UNIQUE CLUSTERED INDEX idx_from_group_id_to_group_id ON #group_rels(from_group_id, to_group_id);
CREATE UNIQUE NONCLUSTERED INDEX idx_to_group_id_from_group_id ON #group_rels(to_group_id, from_group_id);

-------------------------------------------------

CREATE TABLE #G
(
  group_id INT NOT NULL,
  parent_group_id INT NOT NULL,
  lvl INT NOT NULL,
  PRIMARY KEY NONCLUSTERED (group_id),
  UNIQUE CLUSTERED(lvl, group_id)
);

DECLARE @lvl AS INT = 1, @added AS INT, @from_group_id AS INT, @to_group_id AS INT;
DECLARE @CurIds AS TABLE(id INT NOT NULL);


-- gets the first relationship pair 
-- will use the from_group_id as a 'root' group
SELECT TOP (1) 
    @from_group_id = from_group_id, 
    @to_group_id = to_group_id

FROM 
    #group_rels

ORDER BY 
    from_group_id, 
    to_group_id;

SET @added = @@ROWCOUNT;


WHILE @added > 0
BEGIN

    -- inserts two rows into the output table:
    -- a self pairing using from_group_id 
    -- AND the actual relationship pair 
    INSERT INTO #G
        (group_id, parent_group_id, lvl) 
    VALUES
        (@from_group_id, @from_group_id, @lvl),
        (@to_group_id, @from_group_id, @lvl);

    -- removes the pair from input table
    DELETE FROM #group_rels 
    WHERE 
        from_group_id = @from_group_id 
        AND to_group_id = @to_group_id;

    WHILE @added > 0
    BEGIN

        -- increment the lvl variable so we only look at the most recently inserted data 
        SET @lvl += 1;

        ----------------------------------------------------------------------------

        -- the same basic chunk of code is done twice 
        --      once for group_ids in the output table that join against from_group_id and 
        --      once for group_ids in the output table that join against to_group_id

        -- 1 -  join the output table against the input table, looking for any groups that join 
        --      against groups that have already been found to (directly or indirectly) connect to the root group.
        -- 2 -  store the found group_ids in the @CurIds table variable and delete the relationship from the input table.
        -- 3 -  insert the group_ids in the output table using @from_group_id (the current root node id) as the parent group id

        -- if any rows are added to the output table in either chunk, loop and look for any groups that may connect to them.

        ------------------------------------------------------------------------------

        DELETE FROM @CurIds; 

        DELETE FROM group_rels
            OUTPUT deleted.to_group_id AS id INTO @CurIds(id)
        FROM 
            #G AS G
            INNER JOIN #group_rels AS group_rels
                ON G.group_id = group_rels.from_group_id
        WHERE 
            lvl = @lvl - 1;

        INSERT INTO #G
        (group_id, parent_group_id, lvl)
        SELECT DISTINCT 
            id, 
            @from_group_id AS parent_group_id, 
            @lvl AS lvl
        FROM 
            @CurIds AS C
        WHERE 
            NOT EXISTS
            ( 
                SELECT 
                    * 
                FROM 
                    #G AS G
                WHERE 
                    G.group_id = C.id 
            );


        SET @added = @@ROWCOUNT;

        -----------------------------------------------------------------------------------
        DELETE FROM @CurIds;

        DELETE FROM group_rels
        OUTPUT deleted.from_group_id AS id INTO @CurIds(id)
        FROM 
            #G AS G
            INNER JOIN #group_rels AS group_rels
                ON G.group_id = group_rels.to_group_id
        WHERE 
            lvl = @lvl - 1;           

        INSERT INTO #G
        (group_id, parent_group_id, lvl)
        SELECT DISTINCT 
            id, 
            @from_group_id AS grp, 
            @lvl AS lvl
        FROM 
            @CurIds AS C
        WHERE 
            NOT EXISTS
            ( 
                SELECT 
                    * 
                FROM 
                    #G AS G
                WHERE 
                    G.group_id = C.id 
            );

        SET @added += @@ROWCOUNT;

    END;

    ------------------------------------------------------------------------------
    -- At this point, no new rows were added, so the cluster should be complete.
    -- Look for another row in the input table to use as a root group

    SELECT TOP (1) 
        @from_group_id = from_group_id, 
        @to_group_id = to_group_id
    FROM 
        #group_rels
    ORDER BY 
        from_group_id, 
        to_group_id;

    SET @added = @@ROWCOUNT;
END;

SELECT 
parent_group_id,
group_id, 
lvl
FROM #G
--ORDER BY
--parent_group_id,
--group_id, 
--lvl


-------------------------------------------------
DROP TABLE #G
DROP TABLE #group_rels

答案 1 :(得分:0)

以下面的陈述为先行:

CREATE TABLE products
(
    group_id int not null,
    product_id int not null
)

INSERT INTO products
VALUES(1, 101)
,(1, 102)
,(1, 103)
,(2, 101)
,(3, 103)
,(4, 104)
,(5, 101)
,(5, 104)
,(5, 105)
,(6, 105)
,(6, 106)
,(6, 107)
,(7, 110)
,(7, 111)


;WITH bottomLevel AS (
      SELECT ISNULL(MIN(matchedGroup),group_id) as parent_group_id
      , group_id
      FROM products p
      OUTER APPLY (
        SELECT MIN(group_id) AS matchedGroup
        FROM products sp
        WHERE sp.group_id != p.group_id
        AND sp.product_id = p.product_id
      ) oa
      GROUP BY p.group_id
),
rc AS (
   SELECT parent_group_id
      , group_id
   FROM bottomLevel
   UNION ALL
   SELECT b.parent_group_id
      , r.group_id
   FROM rc r
   INNER JOIN bottomLevel b
   ON r.parent_group_id = b.group_id
   AND b.parent_group_id < r.parent_group_id
)
SELECT MIN(parent_group_id) as parent_group_id
, group_id
FROM rc
GROUP BY group_id
ORDER BY group_id

OPTION (MAXRECURSION 32767)

我首先按group_id分组,获得具有匹配产品的最小group_id,并递归加入具有次要父级的父母。

现在这个解决方案可能不会涵盖您在制作中可能遇到的所有异常,但应该可以帮助您从某个地方开始。

此外,如果您有一个大型产品表,这可能会运行得非常慢,因此请考虑使用C#SparkSSIS或任何其他数据操作引擎进行此数据匹配。< / p>