Levenshtein Distance, in Three Flavors

本文介绍了Levenshtein距离算法的基本概念及其在拼写检查、语音识别等领域的应用,并提供了Java、C++及Visual Basic三种语言的实现代码。

转帖自:http://www.merriampark.com/ld.htm

Levenshtein Distance, in Three Flavors

by Michael Gilleland

The purpose of this short essay is to describe the Levenshtein distance algorithm and show how it can be implemented in three different programming languages.

What is Levenshtein Distance?
Demonstration
The Algorithm
Source Code, in Three Flavors
References
Other Flavors


What is Levenshtein Distance?

Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

  • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
  • If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

 

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

  • Spell checking
  • Speech recognition
  • DNA analysis
  • Plagiarism detection

 


Demonstration

The following simple Java applet allows you to experiment with different strings and compute their Levenshtein distance:

 


The Algorithm

Steps

StepDescription
1Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2Initialize the first row to 0..n.
Initialize the first column to 0..m.
3Examine each character of s (i from 1 to n).
4Examine each character of t (j from 1 to m).
5If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

Example

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2
  GUMBO
 012345
G1     
A2     
M3     
B4     
O5     
L6     
Steps 3 to 6 When i = 1
  GUMBO
 012345
G10    
A21    
M32    
B43    
O54    
L65    
Steps 3 to 6 When i = 2
  GUMBO
 012345
G101   
A211   
M322   
B433   
O544   
L655   
Steps 3 to 6 When i = 3
  GUMBO
 012345
G1012  
A2112  
M3221  
B4332  
O5443  
L6554  
Steps 3 to 6 When i = 4
  GUMBO
 012345
G10123 
A21123 
M32212 
B43321 
O54432 
L65543 
Steps 3 to 6 When i = 5
  GUMBO
 012345
G101234
A211234
M322123
B433212
O544321
L655432
Step 7

The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).


Source Code, in Three Flavors

Religious wars often flare up whenever engineers discuss differences between programming languages. A typical assertion is Allen Holub's claim in a JavaWorld article (July 1999): "Visual Basic, for example, isn't in the least bit object-oriented. Neither is Microsoft Foundation Classes (MFC) or most of the other Microsoft technology that claims to be object-oriented."

A salvo from a different direction is Simson Garfinkels's article in Salon (Jan. 8, 2001) entitled "Java: Slow, ugly and irrelevant", which opens with the unequivocal words "I hate Java".

We prefer to take a neutral stance in these religious wars. As a practical matter, if a problem can be solved in one programming language, you can usually solve it in another as well. A good programmer is able to move from one language to another with relative ease, and learning a completely new language should not present any major difficulties, either. A programming language is a means to an end, not an end in itself.

As a modest illustration of this principle of neutrality, we present source code which implements the Levenshtein distance algorithm in the following programming languages:

These three implementations are hereby placed in the public domain and are free for anyone to use.

 


Java

public class Distance {

  //****************************
  // Get minimum of three values
  //****************************

  private int Minimum (int a, int b, int c) {
  int mi;

    mi = a;
    if (b < mi) {
      mi = b;
    }
    if (c < mi) {
      mi = c;
    }
    return mi;

  }

  //*****************************
  // Compute Levenshtein distance
  //*****************************

  public int LD (String s, String t) {
  int d[][]; // matrix
  int n; // length of s
  int m; // length of t
  int i; // iterates through s
  int j; // iterates through t
  char s_i; // ith character of s
  char t_j; // jth character of t
  int cost; // cost

    // Step 1

    n = s.length ();
    m = t.length ();
    if (n == 0) {
      return m;
    }
    if (m == 0) {
      return n;
    }
    d = new int[n+1][m+1];

    // Step 2

    for (i = 0; i <= n; i++) {
      d[i][0] = i;
    }

    for (j = 0; j <= m; j++) {
      d[0][j] = j;
    }

    // Step 3

    for (i = 1; i <= n; i++) {

      s_i = s.charAt (i - 1);

      // Step 4

      for (j = 1; j <= m; j++) {

        t_j = t.charAt (j - 1);

        // Step 5

        if (s_i == t_j) {
          cost = 0;
        }
        else {
          cost = 1;
        }

        // Step 6

        d[i][j] = Minimum (d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1] + cost);

      }

    }

    // Step 7

    return d[n][m];

  }

}

C++

In C++, the size of an array must be a constant, and this code fragment causes an error at compile time:

int sz = 5;
int arr[sz];

This limitation makes the following C++ code slightly more complicated than it would be if the matrix could simply be declared as a two-dimensional array, with a size determined at run-time.

In C++ it's more idiomatic to use the System Template Library's vector class, as Anders Sewerin Johansen has done in an alternative C++ implementation.

Here is the definition of the class (distance.h):

class Distance
{
  public:
    int LD (char const *s, char const *t);
  private:
    int Minimum (int a, int b, int c);
    int *GetCellPointer (int *pOrigin, int col, int row, int nCols);
    int GetAt (int *pOrigin, int col, int row, int nCols);
    void PutAt (int *pOrigin, int col, int row, int nCols, int x);
}; 

Here is the implementation of the class (distance.cpp):

#include "distance.h"
#include <string.h>
#include <malloc.h>

//****************************
// Get minimum of three values
//****************************

int Distance::Minimum (int a, int b, int c)
{
int mi;

  mi = a;
  if (b < mi) {
    mi = b;
  }
  if (c < mi) {
    mi = c;
  }
  return mi;

}

//**************************************************
// Get a pointer to the specified cell of the matrix
//************************************************** 

int *Distance::GetCellPointer (int *pOrigin, int col, int row, int nCols)
{
  return pOrigin + col + (row * (nCols + 1));
}

//*****************************************************
// Get the contents of the specified cell in the matrix 
//*****************************************************

int Distance::GetAt (int *pOrigin, int col, int row, int nCols)
{
int *pCell;

  pCell = GetCellPointer (pOrigin, col, row, nCols);
  return *pCell;

}

//*******************************************************
// Fill the specified cell in the matrix with the value x
//*******************************************************

void Distance::PutAt (int *pOrigin, int col, int row, int nCols, int x)
{
int *pCell;

  pCell = GetCellPointer (pOrigin, col, row, nCols);
  *pCell = x;

}

//*****************************
// Compute Levenshtein distance
//*****************************

int Distance::LD (char const *s, char const *t)
{
int *d; // pointer to matrix
int n; // length of s
int m; // length of t
int i; // iterates through s
int j; // iterates through t
char s_i; // ith character of s
char t_j; // jth character of t
int cost; // cost
int result; // result
int cell; // contents of target cell
int above; // contents of cell immediately above
int left; // contents of cell immediately to left
int diag; // contents of cell immediately above and to left
int sz; // number of cells in matrix

  // Step 1	

  n = strlen (s);
  m = strlen (t);
  if (n == 0) {
    return m;
  }
  if (m == 0) {
    return n;
  }
  sz = (n+1) * (m+1) * sizeof (int);
  d = (int *) malloc (sz);

  // Step 2

  for (i = 0; i <= n; i++) {
    PutAt (d, i, 0, n, i);
  }

  for (j = 0; j <= m; j++) {
    PutAt (d, 0, j, n, j);
  }

  // Step 3

  for (i = 1; i <= n; i++) {

    s_i = s[i-1];

    // Step 4

    for (j = 1; j <= m; j++) {

      t_j = t[j-1];

      // Step 5

      if (s_i == t_j) {
        cost = 0;
      }
      else {
        cost = 1;
      }

      // Step 6 

      above = GetAt (d,i-1,j, n);
      left = GetAt (d,i, j-1, n);
      diag = GetAt (d, i-1,j-1, n);
      cell = Minimum (above + 1, left + 1, diag + cost);
      PutAt (d, i, j, n, cell);
    }
  }

  // Step 7

  result = GetAt (d, n, m, n);
  free (d);
  return result;
	
}

Visual Basic

'*******************************
'*** Get minimum of three values
'*******************************

Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
Dim mi As Integer
                          
  mi = a
  If b < mi Then
    mi = b
  End If
  If c < mi Then
    mi = c
  End If
  
  Minimum = mi
                          
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LD(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost
  
  ' Step 1
  
  n = Len(s)
  m = Len(t)
  If n = 0 Then
    LD = m
    Exit Function
  End If 
  If m = 0 Then
    LD = n
    Exit Function
  End If 
  ReDim d(0 To n, 0 To m) As Integer
  
  ' Step 2
  
  For i = 0 To n
    d(i, 0) = i
  Next i
  
  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For i = 1 To n
    
    s_i = Mid$(s, i, 1)
    
    ' Step 4
    
    For j = 1 To m
      
      t_j = Mid$(t, j, 1)
      
      ' Step 5
      
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      
      ' Step 6
      
      d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
    
    Next j
    
  Next i
  
  ' Step 7
  
  LD = d(n, m)
  Erase d

End Function

References

Other discussions of Levenshtein distance are:


Other Flavors

The following people have kindly consented to make their implementations of the Levenshtein Distance Algorithm in various languages available here:

  • Eli Bendersky has written an implementation in Perl.
  • Barbara Boehmer has written an implementation in Oracle PL/SQL.
  • Rick Bourner has written an implementation in Objective-C.
  • Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings.
  • Joseph Gama has written an implementation in TSQL, as part of a package of TSQL functions at Planet Source Code.
  • Anders Sewerin Johansen has written an implementation in C++, which is more elegant, better optimized, and more in the spirit of C++ than mine.
  • Lasse Johansen has written an implementation in C#.
  • Adam Lindberg and Fredrik Svensson have written an implementation in Erlang.
  • Alvaro Jeria Madariaga has written an implementation in Delphi.
  • Lorenzo Seidenari has written an implementation in C, and Lars Rustemeier has provided a Scheme wrapper for this C implementation as part of Eggs Unlimited, a library of extensions to the Chicken Scheme system.
  • Steve Southwell has written an implementation in Progress 4gl.
  • Lukasz Stilger has written an implementation in JavaScript which illustrates the algorithm in operation (well worth seeing). Note that "wyraz" is Polish for "word". A separate page with the source code as text is here.
  • Jorge Mas Trullenque points out that "the calculation needs O(n) memory, so using a two-dimensional matrix in a practical implementation is wasteful." He has written an implementation in Perl that uses only one one-dimensional vector.
  • Joerg F. Wittenberger has written an implementation in Rscheme.

 

Other implementations outside these pages include:

  • An Emacs Lisp implementation by Art Taylor.
  • A Python implementation by Magnus Lie Hetland.
  • A Tcl implementation by Richard Suchenwirth (thanks to Stefan Seidler for pointing this out).
  • A PHP implementation (thanks to Dan Tripp for pointing this out).
  • A Scheme implementation by Neil Van Dyke.
内容概要:本文详细介绍了“秒杀商城”微服务架构的设计与实战全过程,涵盖系统从需求分析、服务拆分、技术选型到核心功能开发、分布式事务处理、容器化部署及监控链路追踪的完整流程。重点解决了高并发场景下的超卖问题,采用Redis预减库存、消息队列削峰、数据库乐观锁等手段保障数据一致性,并通过Nacos实现服务注册发现与配置管理,利用Seata处理跨服务分布式事务,结合RabbitMQ实现异步下单,提升系统吞吐能力。同时,项目支持Docker Compose快速部署和Kubernetes生产级编排,集成Sleuth+Zipkin链路追踪与Prometheus+Grafana监控体系,构建可观测性强的微服务系统。; 适合人群:具备Java基础和Spring Boot开发经验,熟悉微服务基本概念的中高级研发人员,尤其是希望深入理解高并发系统设计、分布式事务、服务治理等核心技术的开发者;适合工作2-5年、有志于转型微服务或提升架构能力的工程师; 使用场景及目标:①学习如何基于Spring Cloud Alibaba构建完整的微服务项目;②掌握秒杀场景下高并发、超卖控制、异步化、削峰填谷等关键技术方案;③实践分布式事务(Seata)、服务熔断降级、链路追踪、统一配置中心等企业级中间件的应用;④完成从本地开发到容器化部署的全流程落地; 阅读建议:建议按照文档提供的七个阶段循序渐进地动手实践,重点关注秒杀流程设计、服务间通信机制、分布式事务实现和系统性能优化部分,结合代码调试与监控工具深入理解各组件协作原理,真正掌握高并发微服务系统的构建能力。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值