视为重复的相似值

时间:2019-01-30 13:49:51

标签: sql postgresql select

我有下表:

    Orders 
    order_id
    9
    10
    11

    Order_details 
    order_id, product_id  
    9,        7    
    10,       5
    10,       6
    11,       6
    11,       7

    Products 
    product_id, product_name, price
    5,          potato,       4.99
    6,          potato *,     7.5
    7,          orange,       7.99

我已经收到了有关如何查找商品名称重复的订单的反馈,但是现在情况变得有些复杂,因为事实证明,重复的位置在商品名称后带有附加符号“ *”,如上所示。

如何添加到此查询可能性中,以仅计算其中一个产品没有其他字符而其他产品带有其他字符的订单?

例如,将忽略“马铃薯”和“马铃薯”,也将忽略“马铃薯*”和“马铃薯*”,但结果中将包含“马铃薯”和“马铃薯*”的顺序

select od.order_id
from order_details od join
     products p
     on od.product_id = p.product_id
group by od.order_id
having count(p.product_name) > count(distinct p.product_name)

2 个答案:

答案 0 :(得分:1)

一个选择可能只是简单地替换以从产品名称中删除*

SELECT
    od.order_id
FROM order_details od
INNER JOIN products p
    ON od.product_id = p.product_id
GROUP BY
    od.order_id
HAVING
    COUNT(DISTINCT p.product_name) <>
    COUNT(DISTINCT REPLACE(p.product_name, ' *', ''));

Demo

该演示是针对MySQL的,但同一查询应至少在其他几个数据库上运行。

理想情况下,最好在产品名称上进行正则表达式替换,这样可以避免后跟*的空格出现在产品名称的合法部分。

编辑:

由于您使用的是Postgres,因此我们实际上可以进行更具针对性的正则表达式替换:

SELECT
    od.order_id
FROM order_details od
INNER JOIN products p
    ON od.product_id = p.product_id
GROUP BY
    od.order_id
HAVING
    COUNT(DISTINCT p.product_name) <>
    COUNT(DISTINCT REGEXP_REPLACE(p.product_name, ' \*$', ''));

Demo

答案 1 :(得分:0)

您可以在最长的初始子字符串上


CREATE TABLE products (
        product_id INTEGER NOT NULL PRIMARY KEY
        , product_name text
        , price DECIMAL(8,2)
        );

INSERT  INTO products(product_id, product_name, price) VALUES
    (5,          'potato',       4.99)
    ,(6,          'potato *',     7.5)
    ,(1,          'potatoes',     7.48) -- added these
    ,(2,          'potatoe',     7.49)  --
    ,(7,          'orange',       7.99)
        ;

ALTER TABLE products
        ADD COLUMN parent_id INTEGER REFERENCES products(product_id)
        , ADD COLUMN canonical_id INTEGER REFERENCES products(product_id);

UPDATE products
SET canonical_id = product_id;

SELECT*FROM products;

WITH xxx AS  ( select product_id, product_name
        , length(product_name) AS len
        FROM products
        )
UPDATE products dst
SET parent_id = src.product_id
FROM xxx src
-- WHERE position (src.product_name IN dst.product_name) = 1
WHERE dst.product_name LIKE src.product_name ||'%'::text
AND src.len > 4
AND src.len < length(dst.product_name)
 AND NOT EXISTS (
        SELECT * FROM xxx nx
        WHERE dst.product_name LIKE nx.product_name|| '%'::text
        AND nx.len < length(dst.product_name)
        AND nx.len > src.len
        AND nx.product_id <> dst.product_id
        )
        ;

SELECT*FROM products;

WITH yyy AS  ( select product_id, product_name
        , length(product_name) AS len
        FROM products
        )
UPDATE products dst
SET canonical_id = src.product_id
FROM yyy src
WHERE dst.product_name LIKE src.product_name ||'%'::text
AND src.len > 4
AND src.len < length(dst.product_name)
 AND NOT EXISTS (
        SELECT * FROM yyy nx
        WHERE dst.product_name LIKE nx.product_name|| '%'::text
        AND nx.len < src.len
        )
        ;

SELECT*FROM products;

结果:


DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 5
ALTER TABLE
UPDATE 5
 product_id | product_name | price | parent_id | canonical_id 
------------+--------------+-------+-----------+--------------
          5 | potato       |  4.99 |           |            5
          6 | potato *     |  7.50 |           |            6
          1 | potatoes     |  7.48 |           |            1
          2 | potatoe      |  7.49 |           |            2
          7 | orange       |  7.99 |           |            7
(5 rows)

UPDATE 3
 product_id | product_name | price | parent_id | canonical_id 
------------+--------------+-------+-----------+--------------
          5 | potato       |  4.99 |           |            5
          7 | orange       |  7.99 |           |            7
          6 | potato *     |  7.50 |         5 |            6
          2 | potatoe      |  7.49 |         5 |            2
          1 | potatoes     |  7.48 |         2 |            1
(5 rows)

UPDATE 3
 product_id | product_name | price | parent_id | canonical_id 
------------+--------------+-------+-----------+--------------
          5 | potato       |  4.99 |           |            5
          7 | orange       |  7.99 |           |            7
          6 | potato *     |  7.50 |         5 |            5
          2 | potatoe      |  7.49 |         5 |            5
          1 | potatoes     |  7.48 |         2 |            5
(5 rows)

注意:这可能需要一些其他的启发式调整。 (甚至手动编辑)