我需要找到一种方法来从PostgreSQL的数组中确定最常见的子字符串。
我在PostgreSQL的一列中有一个维数组,用于存储CPV值(嵌套分类词汇-https://simap.ted.europa.eu/cpv)。这些代码由数字字符组成,但是由于某些记录的前导零而存储为varchar,如下所示:
["45331110", "50721000", "45251250", "42160000", "39715000", "45315000", "09323000", "71321200", "45331100", "50720000"]
我想使用PostgreSQL从该数组中提取最常见的前两位数字,在示例情况下为45
。
答案 0 :(得分:2)
如果您想每行获得最常见的前两位数字,则可以使用:
WITH data_rows(id, cpv_values) AS (
VALUES (1, ARRAY ['45331110', '50721000', '45251250','42160000','39715000','45315000', '09323000','71321200','45331100', '50720000'])
, (2, ARRAY ['50721000']) -- second test case
)
SELECT id, leading_two_digits
FROM data_rows
-- for every row in `data_rows` (your table),
-- select the most common `leading_two_digits` (through GROUP BY/ORDER BY/LIMIT 1)
JOIN LATERAL (
SELECT left(code, 2) AS leading_two_digits
FROM unnest(cpv_values) AS f(code)
GROUP BY left(code, 2)
ORDER BY COUNT(*) DESC
LIMIT 1
) s ON true
返回
+--+------------------+
|id|leading_two_digits|
+--+------------------+
|1 |45 |
|2 |50 |
+--+------------------+
如果要在所有行中使用最常见的前两位数字,则可以使用:
WITH data_rows(cpv_values) AS (
VALUES (ARRAY ['45331110', '50721000', '45251250','42160000','39715000','45315000', '09323000','71321200','45331100', '50720000']),
(ARRAY ['45'])
)
SELECT left(code, 2) AS leading_two_digits
FROM data_rows, unnest(cpv_values) AS f(code)
GROUP BY left(code, 2)
ORDER BY COUNT(*) DESC
LIMIT 1
答案 1 :(得分:1)
此查询可满足您的需求。
select substr(t, 1, 2) mc
from unnest(array['45331110', '50721000', '45251250', '42160000', '39715000', '45315000', '09323000', '71321200', '45331100', '50720000']) t
group by mc
order by count(1) desc
limit 1;
结果:
Name|Value|
----|-----|
mc |45 |
您可以使用上面的thie作为子查询来提取每行最常见的子字符串。