唯一字符串的缩写

时间:2018-10-06 07:33:50

标签: sql r oracle distinct-values

我有一个唯一的字符串列表(最初的想法是表中的列名)。 该任务是执行列表的最大可能缩写,因此列表保持不同。

例如,AAA, AB可以缩写为AA, AB。 (但不要A, AB –因为A可以同时是AAAAB的前缀)。 AAAA, BAAAA可以缩短为A, B。 但是A1, A2根本不能缩写。

以下是示例数据

create table tab as 
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;

预期结果是

COL    ABR_COL                
------ ------------------------
AAA    AAA                      
AAAAAA AAAA                     
AABA   AAB                      
AB     AB                       
BAAAA  BA                       
BBAA   BB                       
COL1   COL1                     
COL21  COL2        

我管理了由四个子查询组成的蛮力解决方案,我没有故意发布,因为我希望有一个更简单的解决方案,不想让我分心。

r中有一个名为abbreviate的类似功能,但我正在寻找SQL解决方案。欢迎使用针对其他RDBMS的首选Oracle解决方案。

3 个答案:

答案 0 :(得分:3)

使用递归CTE实际上是可行的。我并没有真正使它比三个子查询(加上一个查询)短,但是至少它不受字符串长度的限制。步骤大致如下:

  1. 使用递归CTE计算所有可能的缩写。这将选择所有列 自己命名,然后递归地将列名缩短一个字母:

表格:

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
  1. 对于每个缩写,计算其出现的频率

ABBR    CONFLICT
----    --------
AA      3
AAA     2
AABA    1
...
  1. 选择唯一的最短缩写,然后 只是列名本身的缩写,并根据缩写的长度对其进行排名。在示例中,您看到AAA与其他一些缩写冲突,但是仍然必须选择它,因为它等于其未缩写的名称。

COL     ABBR    CONFLICT    POS
-------------------------------
AAA     AAA     2           1
AAAAAA  AAAA    1           1
AAAAAA  AAAAA   1           2
AAAAAA  AAAAAA  1           3
AABA    AAB     1           1
...
  1. 为每个列选择排名第一的缩写(或列名本身)。

COL     ABBR    POS
-------------------
AAA     AAA     1
AAAAAA  AAAA    1
AABA    AAB     1
...

完整的SQL

这将导致以下SQL,并将上述步骤作为CTE:

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations
  where length(abbr) > 1
)
, abbreviation_counts as (
  select abbr
       , count(*) as conflict
  from potential_abbreviations
  group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
    p.col
  , p.abbr
  , conflict
  , rank() over (partition by col order by p.abbr) as pos
  from potential_abbreviations p
    join abbreviation_counts c on p.abbr = c.abbr
    where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

答案 1 :(得分:3)

我将在递归CTE中进行过滤:

with potential_abbreviations(col, abbr, lev) as (
      select col, col as abbr, 1 as lev
      from tab
      union all
      select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
      from potential_abbreviations pa
      where length(abbr) > 1 and
            not exists (select 1
                        from tab
                        where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
                              tab.col <> pa.col
                       )
     )
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
      from potential_abbreviations pa
     ) pa
where seqnum = 1

Here是db <>小提琴。

严格不需要lev。您可以在length(abbr) desc中使用order by。但是,当我使用递归CTE时,通常会包含一个递归计数器,所以这是习惯。

在CTE中进行额外的比较可能看起来更复杂,但它简化了执行-递归以正确的值停止。

这也在唯一的单个字母col值上进行了测试。

答案 2 :(得分:1)

我发现了第二种方法,它没有添加到第一个答案中,因为它又短又不同。步骤如下:

  1. 递归计算每个名称的所有潜在缩写

SQL

  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
  where length(abbr) > 1

结果

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
  1. 然后计算缩写之间的冲突。还要跟踪导致此缩写的列名。我们只希望保留不引起冲突的缩写,因此min()聚合是无关紧要的。

SQL

select
    abbr
  , count(*) as conflicts
  , min(col) as best_candidate
  from potential_abbreviations
 group by abbr
having count(*) = 1

结果

ABBR    CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA    1         AAAAAA
AAAAA   1         AAAAAA
AAAAAA  1         AAAAAA
AAB     1         AABA
AABA    1         AABA
...
  1. 最后,对潜在缩写词与最佳无冲突候选者进行左连接,如果没有无冲突解决方案,则仅使用列名:

SQL

select
    p.col as col
  , nvl(min(c.abbr), p.col) as abbr
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col
  order by col, abbr

完整的SQL

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
 where length(abbr) > 1
)
, conflict_free as (
    select
        abbr
      , count(*) as conflicts
      , min(col) as best_candidate
      from potential_abbreviations
     group by abbr
    having count(*) = 1
)
select
    p.col as col
  -- , c.best_candidate
  , nvl(min(c.abbr), p.col) as abbr
  -- , min(c.abbr) over (partition by c.best_candidate) shortest
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col, c.best_candidate
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

注意:对于Postgresql,递归CTE必须为with recursive,而Oracle根本不喜欢recursive一词。