BF&KMP&BM简析

一、BF

BF,简单直观也是字符串处理库中使用最多的,在处理数据长度不大的时候效率不会比KMP和BM慢,示例代码如下:

/**
     * Code shared by String and StringBuffer to do searches. The
     * source is the character array being searched, and the target
     * is the string being searched for.
     *
     * @param   source       the characters being searched.
     * @param   sourceOffset offset of the source string.
     * @param   sourceCount  count of the source string.
     * @param   target       the characters being searched for.
     * @param   targetOffset offset of the target string.
     * @param   targetCount  count of the target string.
     * @param   fromIndex    the index to begin searching from.
     */
    static int indexOf(char[] source, int sourceOffset, int sourceCount,
            char[] target, int targetOffset, int targetCount,
            int fromIndex) {
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }

        char first = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);

        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
            if (source[i] != first) {
                while (++i <= max && source[i] != first);
            }

            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                for (int k = targetOffset + 1; j < end && source[j]
                        == target[k]; j++, k++);

                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }

二、KMP

在BF算法中,每次匹配pattern字符串后是简单的把text的下标回溯,pattern的下标设置为0,这种方法没有充分利用我们已经对比的字符中已知的信息,如果能把已知的信息利用上,就会减少很多重复工作。

KMP算法通过对pattern进行预计算来减少下标回溯的距离,进而加快查找的速度,预计算的数据存储在一个数组中,举例说明:

text:ababcdefgh           pattern:ababd

ababcdefgh

  a babd                BF算法

    aba bd              KMP算法

可以看到当pattern自身有重复的时候,KMP算法利用了这个信息,不需要重新匹配已经匹配的重复字符,匹配算法讲解如下:

对pattern预计算的结果存储在failure[]中,failure[0] = 0,假设failure[i] = k,即text[0 … k-1] == pattern[i-k, i-1],当pattern[i] == pattern[k],则pattern[0 … k] == pattern[i-k, i],进而failure[i+1] = failure[i] + 1 = k + 1;当pattern[i] != pattern[k],模式匹配失败,则k = failure[k]。KMP算法如下:

public class KMPMatch {
  
  private String string;
  private String pattern;
  private int[] failure;
  private int matchPoint;
  
  public KMPMatch(String string, String pattern) {
    this.string = string;
    this.pattern = pattern;
    failure = new int[pattern.length()];
    computeFailure();
  }
      
  public int getMatchPoint() {
    return matchPoint;
  }
  
  
  public boolean match() {
    // Tries to find an occurence of the pattern in the string
    
    int j = 0;
    if (string.length() == 0) return false;
    
    for (int i = 0; i < string.length(); i++) {
      while (j > 0 && pattern.charAt(j) != string.charAt(i)) {
        j = failure[j - 1];
      }
      if (pattern.charAt(j) == string.charAt(i)) { j++; }
      if (j == pattern.length()) {
        matchPoint = i - pattern.length() + 1;
        return true;
      }
    }
    return false;
  }
  
  public boolean match1() {
   
    int i = 0;
    int j = 0;
    if (string.length() == 0) return false;
    
    while (i + pattern.length() - j <= string.length()) {
      if (j >= pattern.length()) {
        matchPoint = i - pattern.length();        
        return true;
      }
      if (string.charAt(i) == pattern.charAt(j)) {
        i++;
        j++;        
      } else {
        if (j > 0) { j = failure[j - 1]; }
        else { i++; }
      }
    }  
    return false;    
  }
  
  /** 
   * Computes the failure function using a boot-strapping process,
   * where the pattern is matched against itself.
   */
  private void computeFailure() {

    int j = 0;
    for (int i = 1; i < pattern.length(); i++) {
      while (j > 0 && pattern.charAt(j) != pattern.charAt(i)) { j = failure[j - 1]; }
      if (pattern.charAt(j) == pattern.charAt(i)) { j++; }
      failure[i] = j;
    }
  }
  
}

根据KMP的pattern的failure生成规则可以看出,kmp比较适合字符集基数较少,pattern中能够出现较多重复规则的情况下使用性能比较好,例如DNA等

三、BM

BM算法有两个要点,

(1)在对比text和pattern的时候,pattern从后往前对比

(2)在从后向前对比pattern和text的时候,c是text中当前对比的字符,如果相等正常向前对比就行;如果不相等,分两种情况处理:

          1)如果c不包含在pattern中,这个时候text的对比索引可以向前进pattern的长度

          2)如果c包含在pattern里,由于是从后向前对比的,所以找出pattern中最右边出现的c,将text的索引向前使text中的c跟pattern中的c对齐

由于要预先计算出text所属字符集所有字符在pattern中最后一次出现的位置,所以这个字符集不能太大,例如英文字符26个用BM来进行查找就比较合适。代码如下:

public class BoyerMoore {
    private final int R;     // the radix
    private int[] right;     // the bad-character skip array

    private char[] pattern;  // store the pattern as a character array
    private String pat;      // or as a string

    // pattern provided as a string
    public BoyerMoore(String pat) {
        this.R = 256;
        this.pat = pat;

        // position of rightmost occurrence of c in the pattern
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pat.length(); j++)
            right[pat.charAt(j)] = j;
    }

    // pattern provided as a character array
    public BoyerMoore(char[] pattern, int R) {
        this.R = R;
        this.pattern = new char[pattern.length];
        for (int j = 0; j < pattern.length; j++)
            this.pattern[j] = pattern[j];

        // position of rightmost occurrence of c in the pattern
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pattern.length; j++)
            right[pattern[j]] = j;
    }

    // return offset of first match; N if no match
    public int search(String txt) {
        int M = pat.length();
        int N = txt.length();
        int skip;
        for (int i = 0; i <= N - M; i += skip) {
            skip = 0;
            for (int j = M-1; j >= 0; j--) {
                if (pat.charAt(j) != txt.charAt(i+j)) {
                    skip = Math.max(1, j - right[txt.charAt(i+j)]);
                    break;
                }
            }
            if (skip == 0) return i;    // found
        }
        return N;                       // not found
    }


    // return offset of first match; N if no match
    public int search(char[] text) {
        int M = pattern.length;
        int N = text.length;
        int skip;
        for (int i = 0; i <= N - M; i += skip) {
            skip = 0;
            for (int j = M-1; j >= 0; j--) {
                if (pattern[j] != text[i+j]) {
                    skip = Math.max(1, j - right[text[i+j]]);
                    break;
                }
            }
            if (skip == 0) return i;    // found
        }
        return N;                       // not found
    }
}

转载于:https://my.oschina.net/diwayou/blog/149804

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值