最短编辑距离算法,在搜索引擎开发中应用很多,比如相关词等
还可用于比较2个页面是否相同,如果不同需要多少步才能相同。把相同的去掉,提取不同的,这样可以达到自动提取正文的目的,使得文本提取不局限于具体的网页结构,打破模板的缺陷,是垂直搜索更具通用性。
package com.lietu.relatedwords;
public class Distance {
// ****************************
// Get minimum of three values
// ****************************
private static int Minimum(int a, int b, int c) {
int mi;
mi = a;
if (b < mi) {
mi = b;
}
if (c < mi) {
mi = c;
}
return mi;
}
/**
*
* @param s 输入源串
* @param t 输入目标串
* @return 源串和目标串之间的编辑距离
*/
public static int LD(String s, String t) {
int d[][]; // matrix
int n; // length of s
int m; // length of t
int i; // iterates through s
int j; // iterates through t
char s_i; // ith character of s
char t_j; // jth character of t
int cost; // cost
// Step 1 初始化
n = s.length();
m = t.length();
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
d = new int[n + 1][m + 1];
// Step 2 Initialize the first row to 0..n.
for (i = 0; i <= n; i++) {
d[0] = i;
}
//Initialize the first column to 0..m.
for (j = 0; j <= m; j++) {
d[0][j] = j;
}
// Step 3 Examine each character of s (i from 1 to n).
for (i = 1; i <= n; i++) {
s_i = s.charAt(i - 1);
// Step 4 Examine each character of t (j from 1 to m).
for (j = 1; j <= m; j++) {
t_j = t.charAt(j - 1);
// Step 5
// If s equals t[j], the cost is 0.
// If s doesn't equal t[j], the cost is 1.
if (s_i == t_j) {
cost = 0;
} else {
cost = 1;
}
// Step 6
//Set cell d of the matrix equal to the minimum of:
//a. The cell immediately above plus 1: d + 1.
//b. The cell immediately to the left plus 1: d + 1.
//c. The cell diagonally above and to the left plus the cost: d + cost.
d[j] = Minimum(d[j] + 1, d[j - 1] + 1,
d[j - 1] + cost);
}
}
// Step 7
// After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].
return d[n][m];
}
}
还可用于比较2个页面是否相同,如果不同需要多少步才能相同。把相同的去掉,提取不同的,这样可以达到自动提取正文的目的,使得文本提取不局限于具体的网页结构,打破模板的缺陷,是垂直搜索更具通用性。
package com.lietu.relatedwords;
public class Distance {
// ****************************
// Get minimum of three values
// ****************************
private static int Minimum(int a, int b, int c) {
int mi;
mi = a;
if (b < mi) {
mi = b;
}
if (c < mi) {
mi = c;
}
return mi;
}
/**
*
* @param s 输入源串
* @param t 输入目标串
* @return 源串和目标串之间的编辑距离
*/
public static int LD(String s, String t) {
int d[][]; // matrix
int n; // length of s
int m; // length of t
int i; // iterates through s
int j; // iterates through t
char s_i; // ith character of s
char t_j; // jth character of t
int cost; // cost
// Step 1 初始化
n = s.length();
m = t.length();
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
d = new int[n + 1][m + 1];
// Step 2 Initialize the first row to 0..n.
for (i = 0; i <= n; i++) {
d[0] = i;
}
//Initialize the first column to 0..m.
for (j = 0; j <= m; j++) {
d[0][j] = j;
}
// Step 3 Examine each character of s (i from 1 to n).
for (i = 1; i <= n; i++) {
s_i = s.charAt(i - 1);
// Step 4 Examine each character of t (j from 1 to m).
for (j = 1; j <= m; j++) {
t_j = t.charAt(j - 1);
// Step 5
// If s equals t[j], the cost is 0.
// If s doesn't equal t[j], the cost is 1.
if (s_i == t_j) {
cost = 0;
} else {
cost = 1;
}
// Step 6
//Set cell d of the matrix equal to the minimum of:
//a. The cell immediately above plus 1: d + 1.
//b. The cell immediately to the left plus 1: d + 1.
//c. The cell diagonally above and to the left plus the cost: d + cost.
d[j] = Minimum(d[j] + 1, d[j - 1] + 1,
d[j - 1] + cost);
}
}
// Step 7
// After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].
return d[n][m];
}
}

本文介绍了一种名为最短编辑距离的算法,该算法在搜索引擎开发中有着广泛应用,例如用于实现相关词功能及判断两个网页内容的相似度。通过计算两字符串之间的编辑距离,可以确定将一个字符串转换为另一个字符串所需的最小操作数。
1万+

被折叠的 条评论
为什么被折叠?



