我使用Oracle11g
,我会比较两个表,找出它们之间匹配的记录。
示例:
Table 1 Table 2
George Micheal
Michael Paul
记录" Micheal"和"迈克尔"他们之间的匹配,所以他们是很好的记录。
要查看两条记录是否匹配,我使用Oracle
函数utl_match.edit_distance_similarity
。
我尝试使用下面的代码,但是我遇到了性能问题(它太慢了):
SELECT *
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;
有更好的解决方案吗?
谢谢
答案 0 :(得分:1)
这是一个难题。通常,它会导致嵌套循环连接和缓慢。有可能使用SOUNDEX()
来获得"关闭"匹配,然后匹配最终过滤的字符距离函数。这可能对您的问题不起作用,但可能会这样。
虽然我不是该功能的忠实粉丝,但您可能会发现soundex()
适用于您的目的(请参阅here)。
想法是在这个值上添加一个索引:
create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));
然后你会将其查询为:
SELECT *
FROM table1 t1 JOIN
table2 t2
ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;
这个想法是Oracle将使用索引来获取" close"然后编辑距离以获得更好的匹配。这可能不适用于您的问题。这只是一个可能有用的想法。
答案 1 :(得分:1)
如果您在表table1和table2中的名称值有很多冗余,这可能是一个解决方案
-- Test data set
select count(*) from table1;
--> 10.000
select count(*) from table2;
--> 10.000
select count(distinct(name)) from table1;
--> ~ 2500
select count(distinct(name)) from table2;
--> ~ 2500
/* a) Join with function compare */
select table1.name, table2.name
from table1, table2
where utl_match.edit_distance_similarity(table1.name, table2.name) > 35
/*
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 1 | NESTED LOOPS | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 2 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| * 3 | TABLE ACCESS FULL | TABLE2 | 500 | 13500 | 4 | 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)
Note
-----
- dynamic sampling used for this statement
*/
/* b) Join with function, only distinct values */
-- A Set of all existing names (in table1 and table2)
with names as
(select name from table1 union select name from table2),
-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
table_cmp(name1, name2) as
(select n1.name, n2.name
from names n1
join names n2
on n1.name <= n2.name
and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name1
join table2 t2
on t2.name = c.name2
union all
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name2
join table2 t2
on t2.name = c.name1;
/*
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 30469950 | 3290754600 | 2495 | 00:00:38 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D663E_B39FC2B6 | | | | |
| 3 | SORT UNIQUE | | 20000 | 540000 | 12 | 00:00:01 |
| 4 | UNION-ALL | | | | | |
| 5 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 6 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| 7 | LOAD AS SELECT | SYS_TEMP_0FD9D663F_B39FC2B6 | | | | |
| 8 | MERGE JOIN | | 1000000 | 54000000 | 62 | 00:00:01 |
| 9 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 10 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| * 12 | FILTER | | | | | |
| * 13 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 14 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 15 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| 16 | UNION-ALL | | | | | |
| * 17 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 18 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 19 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 20 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 21 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 22 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
| * 23 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 24 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 25 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 26 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 27 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 28 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")
Note
-----
- dynamic sampling used for this statement
*/