(转载)字符串匹配算法——Edit distance

本文介绍了字符串匹配中的编辑距离概念,并详细解析了Levenshtein距离算法。通过此算法,我们能够量化两个字符串间的相似程度,了解如何通过插入、删除及替换操作将一个字符串转换为另一个字符串,并计算所需的最小步骤数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

如何比较两个字符串之间的相似程度(或者差异)?

 

想要比较两个字符串之间的相似程度,可以看其中一个字符串通过几步操作可以转换为另一个字符串,通过度量转换操作的步数可以来衡量两个串的相似程度,如果转换步数越少,则两者越匹配。这里转换操作的度量就称为:

edit distance。该值越小,则两个字符串越匹配。

 

但是对edit distance有不同的definition

 

 http://en.wikipedia.org/wiki/Edit_distance写道
edit distance between two strings of characters generally refers to the Levenshtein distance. 

However,The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and replacements have twice the cost of an insertion”

There are several different ways to define an edit distance, depending on which edit operations are allowed: replace, delete, insert, transpose, and so on.
  

 

相应的也有不同的算法来计算这个值,常见的有Levenshtein distance

 

Levenshtein distance

 

http://en.wikipedia.org/wiki/Levenshtein_distance 写道
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
 

算法基本步骤

 

 

Step Description
1Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2Initialize the first row to 0..n.
Initialize the first column to 0..m.
3Examine each character of s (i from 1 to n).
4Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

 

 

比较"GUMBO"和"GAMBOL"的相似程度,计算两者的edit distance


 

参考

 

1.Levenshtein_distance

http://www.merriampark.com/ld.htm

http://en.wikipedia.org/wiki/Levenshtein_distance

2.Edit_distance

http://en.wikipedia.org/wiki/Edit_distance

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值