我试图弄清楚如何比较由Oracle SQL中可能有不同顺序的字符串短语组成的两列。如果两列都包含相同的短语,即使短语的顺序可能不同,它们也会重复。例如,给定下面的列(Table1.column1和Table1.column2),我想生成Duplicate?列。
null
我做了一些研究,我认为我必须使用LIKE函数或REGEXP_LIKE,但我甚至无法真正创建一个具体的想法来解决这个问题。
其他信息:
>>> new_df = df.groupby(['id'])['dates'].agg({'sort':sorted})
>>> new_df
sort
id
id1 [2008-09-26, 2009-06-03, 2009-07-13, 2009-09-2...
id2 [2009-01-14, 2009-06-17, 2009-08-07, 2011-04-1...
id3 [2010-01-26, 2010-03-16, 2011-11-23, 2012-01-3...
>>> new_df['sort'] = [[lst[0], lst[-1]] for lst in new_df['sort'].tolist()]
>>> new_df
sort
id
id1 [2008-09-26, 2015-09-23]
id2 [2009-01-14, 2015-12-23]
id3 [2010-01-26, 2013-11-12]
>>>
。 任何帮助都将不胜感激!!
答案 0 :(得分:1)
正则表达式拆分在单个字符串上整齐地工作。障碍是通常的方法在多行上产生笛卡儿产品,即在桌子上使用时。我的查询缺口a clever solution from Alex Nuitjen。
要将其分解:前两个子查询对cols进行标记,第三个子查询按字母顺序重新聚合它们,主查询对它们进行重复评估:
0 ≤ probability of value < 1
我假设您有一个键列(我的代码中的ID)。
虽然这个解决方案比@shaileshyadav提出的解决方案更冗长,但它确实具有扩展任意数量令牌的优势。鉴于此测试数据......
with col1 as (
select id, col1, regexp_substr(col1,'[^ ]+', 1, rn) as tkn
from t42
cross join (select rownum rn
from (select max ( regexp_count(col1,' ')+1) + 1 mx from t42)
connect by level <= mx
)
where regexp_substr(col1,'[^ ]+', 1, rn) is not null
order by id
)
, col2 as (
select id, col2, regexp_substr(col2,'[^ ]+', 1, rn) as tkn
from t42
cross join (select rownum rn
from (select max ( regexp_count(col2,' ')+1) + 1 mx from t42)
connect by level <= mx
)
where regexp_substr(col2,'[^ ]+', 1, rn) is not null
order by id
)
, ccat as (
select col1.id
, col1.col1
, listagg(col1.tkn, ' ') within group (order by col1.tkn) as catcol1
, col2.col2
, listagg(col2.tkn, ' ') within group (order by col2.tkn) as catcol2
from col1
join col2 on col1.id = col2.id
group by col1.id, col1.col1, col2.col2 )
select ccat.id
, ccat.col1
, ccat.col2
, case when ccat.catcol1=ccat.catcol2 then 'Y' else 'N' end as duplicate
from ccat
order by ccat.id
/
...查询输出为:
SQL> select * from t42
2 /
ID COL1 COL2
---------- ----------------------- -----------------------
1 ABC DEF DEF ABC
2 ABC DEF GHI ABC
3 ABCD EFGH IJKL MNOP IJKL MNOP ABCD EFGH
4 ABCD EFGH IJKL MNOP IJKL QRST EFGH ABCD
5 ABC ABC DEF DEF ABC DEF
6 AAA BBB CCC DDD EEE AAA BBB CCC DDD
7 AAA BBB CCC DDD EEE AAA BBB CCC DDD EEE
8 XXX YYYY ZZZ AAA BBB AAA BBB XXX ZZZ YYYY
9 A B C D E F G H I J K L L K J I H G F E D C B A
10 AA BB CC DD EE AA BB CC DD FF
10 rows selected.
SQL>
答案 1 :(得分:1)
您可以按照以下方式使用like和regex来获取所需的输出:
select dummy_table.column1, dummy_table.column2,
(case when dummy_table.column1 like ('%' || dummy_table.a || '%')
AND dummy_table.column1 like ('%' || dummy_table.b || '%')
AND dummy_table.column1 like ('%' || dummy_table.c || '%')
AND dummy_table.column1 like ('%' || dummy_table.d || '%')
AND length(abc.column1) = length(abc.column2) THEN 'Y' ELSE 'N' END) as Duplicate
from
(select column1,column2, regexp_substr(column2, '[^ ]+', 1, 1) as a, regexp_substr(column2, '[^ ]+', 1, 2) as b, regexp_substr(column2, '[^ ]+', 1, 3) as c,
regexp_substr(column2, '[^ ]+', 1, 4) as d
from Table1 ) dummy_table;
必填结果:
Table1.column1 Table1.column2 Duplicate?
=====================================================================
ABC DEF DEF ABC Y
ABC DEF GHI ABC N
ABCD EFGH IJKL MNOP IJKL MNOP ABCD EFGH Y
ABCD EFGH IJKL MNOP IJKL QRST EFGH ABCD N
ABC ABC DEF DEF ABC DEF N
注意:&#39; regexp_substr&#39;和&#39;喜欢的情况下&#39;取决于表中值的最大短语数。