一、BF
BF,简单直观也是字符串处理库中使用最多的,在处理数据长度不大的时候效率不会比KMP和BM慢,示例代码如下:
/**
* Code shared by String and StringBuffer to do searches. The
* source is the character array being searched, and the target
* is the string being searched for.
*
* @param source the characters being searched.
* @param sourceOffset offset of the source string.
* @param sourceCount count of the source string.
* @param target the characters being searched for.
* @param targetOffset offset of the target string.
* @param targetCount count of the target string.
* @param fromIndex the index to begin searching from.
*/
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j]
== target[k]; j++, k++);
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
二、KMP
在BF算法中,每次匹配pattern字符串后是简单的把text的下标回溯,pattern的下标设置为0,这种方法没有充分利用我们已经对比的字符中已知的信息,如果能把已知的信息利用上,就会减少很多重复工作。
KMP算法通过对pattern进行预计算来减少下标回溯的距离,进而加快查找的速度,预计算的数据存储在一个数组中,举例说明:
text:ababcdefgh pattern:ababd
ababcdefgh
a
babd
BF算法
aba
bd
KMP算法
可以看到当pattern自身有重复的时候,KMP算法利用了这个信息,不需要重新匹配已经匹配的重复字符,匹配算法讲解如下:
对pattern预计算的结果存储在failure[]中,failure[0] = 0,假设failure[i] = k,即text[0 … k-1] == pattern[i-k, i-1],当pattern[i] == pattern[k],则pattern[0 … k] == pattern[i-k, i],进而failure[i+1] = failure[i] + 1 = k + 1;当pattern[i] != pattern[k],模式匹配失败,则k = failure[k]。KMP算法如下:
public class KMPMatch {
private String string;
private String pattern;
private int[] failure;
private int matchPoint;
public KMPMatch(String string, String pattern) {
this.string = string;
this.pattern = pattern;
failure = new int[pattern.length()];
computeFailure();
}
public int getMatchPoint() {
return matchPoint;
}
public boolean match() {
// Tries to find an occurence of the pattern in the string
int j = 0;
if (string.length() == 0) return false;
for (int i = 0; i < string.length(); i++) {
while (j > 0 && pattern.charAt(j) != string.charAt(i)) {
j = failure[j - 1];
}
if (pattern.charAt(j) == string.charAt(i)) { j++; }
if (j == pattern.length()) {
matchPoint = i - pattern.length() + 1;
return true;
}
}
return false;
}
public boolean match1() {
int i = 0;
int j = 0;
if (string.length() == 0) return false;
while (i + pattern.length() - j <= string.length()) {
if (j >= pattern.length()) {
matchPoint = i - pattern.length();
return true;
}
if (string.charAt(i) == pattern.charAt(j)) {
i++;
j++;
} else {
if (j > 0) { j = failure[j - 1]; }
else { i++; }
}
}
return false;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private void computeFailure() {
int j = 0;
for (int i = 1; i < pattern.length(); i++) {
while (j > 0 && pattern.charAt(j) != pattern.charAt(i)) { j = failure[j - 1]; }
if (pattern.charAt(j) == pattern.charAt(i)) { j++; }
failure[i] = j;
}
}
}
根据KMP的pattern的failure生成规则可以看出,kmp比较适合字符集基数较少,pattern中能够出现较多重复规则的情况下使用性能比较好,例如DNA等
三、BM
BM算法有两个要点,
(1)在对比text和pattern的时候,pattern从后往前对比
(2)在从后向前对比pattern和text的时候,c是text中当前对比的字符,如果相等正常向前对比就行;如果不相等,分两种情况处理:
1)如果c不包含在pattern中,这个时候text的对比索引可以向前进pattern的长度
2)如果c包含在pattern里,由于是从后向前对比的,所以找出pattern中最右边出现的c,将text的索引向前使text中的c跟pattern中的c对齐
由于要预先计算出text所属字符集所有字符在pattern中最后一次出现的位置,所以这个字符集不能太大,例如英文字符26个用BM来进行查找就比较合适。代码如下:
public class BoyerMoore {
private final int R; // the radix
private int[] right; // the bad-character skip array
private char[] pattern; // store the pattern as a character array
private String pat; // or as a string
// pattern provided as a string
public BoyerMoore(String pat) {
this.R = 256;
this.pat = pat;
// position of rightmost occurrence of c in the pattern
right = new int[R];
for (int c = 0; c < R; c++)
right[c] = -1;
for (int j = 0; j < pat.length(); j++)
right[pat.charAt(j)] = j;
}
// pattern provided as a character array
public BoyerMoore(char[] pattern, int R) {
this.R = R;
this.pattern = new char[pattern.length];
for (int j = 0; j < pattern.length; j++)
this.pattern[j] = pattern[j];
// position of rightmost occurrence of c in the pattern
right = new int[R];
for (int c = 0; c < R; c++)
right[c] = -1;
for (int j = 0; j < pattern.length; j++)
right[pattern[j]] = j;
}
// return offset of first match; N if no match
public int search(String txt) {
int M = pat.length();
int N = txt.length();
int skip;
for (int i = 0; i <= N - M; i += skip) {
skip = 0;
for (int j = M-1; j >= 0; j--) {
if (pat.charAt(j) != txt.charAt(i+j)) {
skip = Math.max(1, j - right[txt.charAt(i+j)]);
break;
}
}
if (skip == 0) return i; // found
}
return N; // not found
}
// return offset of first match; N if no match
public int search(char[] text) {
int M = pattern.length;
int N = text.length;
int skip;
for (int i = 0; i <= N - M; i += skip) {
skip = 0;
for (int j = M-1; j >= 0; j--) {
if (pattern[j] != text[i+j]) {
skip = Math.max(1, j - right[text[i+j]]);
break;
}
}
if (skip == 0) return i; // found
}
return N; // not found
}
}