You can compare string using equality operator (=), or using similarity operator (LIKE). But there are cases where it is not enough. You Can use UTL_MATCH package to calculate string similarity index. This packages offers 4 functions that take two strings as parameters:
- edit_distance - algorithm by Levenshtein - returns number of edits that must be done to change one string into second,
- edit_distance_similarity - normalized results of edit_distance in percents - integer values,
- jaro_winkler - returns similarity based on Jaro-Winkler distance algorithm,
- jaro_winkler_similarity - same as above but presented as integer in range 0-100.
create table countries(name varchar2(15) not null);
insert into countries values ('Poland');
insert into countries values ('Germany');
insert into countries values ('United States');
insert into countries values ('Portugal');
insert into countries values ('Czech Republic');
insert into countries values ('China');
insert into countries values ('Slovakia');
insert into countries values ('Slovenia');
commit;
Now we can take a look at results. Lets compare to miss-spelled country name: '
Slovnia'
select name ,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist ,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim ,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler ,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim from countries order by jaro_winkler_sim desc; NAME EDIT EDIT JARO_WINK JARO -------------------- ---- ---- --------- ---- Slovenia 1 88 .9750 97 Slovakia 2 75 .8881 88 China 5 29 .5619 56 United States 12 8 .5531 55 Poland 6 15 .5317 53 Portugal 7 13 .5119 51 Germany 7 0 .3571 35 Czech Republic 13 8 .0000 0Above we can observe differences in algorithms.
Lets test it on NULLs:
SQL> select to_char(utl_match.edit_distance('test', NULL),'999')
2 edit_dist
3 ,to_char(utl_match.edit_distance_similarity('test', NULL),'999')
4 edit_dist_sim
5 ,to_char(utl_match.jaro_winkler('test', NULL),'999d9999')
6 jaro_winkler
7 ,to_char(utl_match.jaro_winkler_similarity('test', NULL),'999')
8 jaro_winkler_sim
9 from dual;
EDIT EDIT JARO_WINK JARO
---- ---- --------- ----
-1 125 .0000 0We can see that using
edit_distance on NULLs migth be dengerous.
All samples were run on Oracle 10g XE
本文介绍如何使用Oracle的UTL_MATCH包来计算字符串之间的相似度指数。通过四个函数:编辑距离、编辑距离相似度、Jaro-Winkler及Jaro-Winkler相似度,展示它们在实际场景中的应用,并比较不同算法的效果。
5402

被折叠的 条评论
为什么被折叠?



