[转载]Dynamic Programming Algorithm (DPA) for Edit-Distance

最新推荐文章于 2024-10-11 09:41:11 发布

转载最新推荐文章于 2024-10-11 09:41:11 发布 · 698 阅读

文章标签：

#algorithm #distance #alignment #delete #insert #character

old_golden_times 专栏收录该内容

62 篇文章

订阅专栏

本文介绍了一种衡量两个字符串相似度的方法——编辑距离算法。通过定义三种基本操作：替换、插入和删除字符，该算法计算出将一个字符串转换为另一个字符串所需的最小操作数。文章详细解释了递归关系及动态规划实现过程。

转自：http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/

The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

change a letter,
insert a letter or
delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0               -- '' = empty string
d(s, '')  = d('', s) = |s|  -- i.e. length of s
d(s1+ch1, s2+ch2)
  = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
         d(s1+ch1, s2) + 1,
         d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0,0] = 0
m[i,0] = i,  i=1..|s1|
m[0,j] = j,  j=1..|s2|

m[i,j] = min(m[i-1,j-1]
             + if s1[i]=s2[j] then 0 else 1 fi,
             m[i-1, j] + 1,
             m[i, j-1] + 1 ),  i=1..|s1|, j=1..|s2|

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n²), much better than exponential!

附算法：

<SCRIPT LANGUAGE="JavaScript">  </SCRIPT>