BF&KMP&BM简析

最新推荐文章于 2020-11-18 18:53:44 发布

转载最新推荐文章于 2020-11-18 18:53:44 发布 · 129 阅读

1 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/diwayou/blog/149804

文章标签：

#python

2019独角兽企业重金招聘Python工程师标准>>>

一、BF

BF，简单直观也是字符串处理库中使用最多的，在处理数据长度不大的时候效率不会比KMP和BM慢，示例代码如下：

/**
     * Code shared by String and StringBuffer to do searches. The
     * source is the character array being searched, and the target
     * is the string being searched for.
     *
     * @param   source       the characters being searched.
     * @param   sourceOffset offset of the source string.
     * @param   sourceCount  count of the source string.
     * @param   target       the characters being searched for.
     * @param   targetOffset offset of the target string.
     * @param   targetCount  count of the target string.
     * @param   fromIndex    the index to begin searching from.
     */
    static int indexOf(char[] source, int sourceOffset, int sourceCount,
            char[] target, int targetOffset, int targetCount,
            int fromIndex) {
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }

        char first = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);

        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
            if (source[i] != first) {
                while (++i <= max && source[i] != first);
            }

            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                for (int k = targetOffset + 1; j < end && source[j]
                        == target[k]; j++, k++);

                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }

二、KMP

在BF算法中，每次匹配pattern字符串后是简单的把text的下标回溯，pattern的下标设置为0，这种方法没有充分利用我们已经对比的字符中已知的信息，如果能把已知的信息利用上，就会减少很多重复工作。

KMP算法通过对pattern进行预计算来减少下标回溯的距离，进而加快查找的速度，预计算的数据存储在一个数组中，举例说明：

text：ababcdefgh pattern:ababd

ababcdefgh

a ~~babd~~ BF算法

aba bd KMP算法

可以看到当pattern自身有重复的时候，KMP算法利用了这个信息，不需要重新匹配已经匹配的重复字符，匹配算法讲解如下：

对pattern预计算的结果存储在failure[]中，failure[0] = 0，假设failure[i] = k，即text[0 … k-1] == pattern[i-k, i-1]，当pattern[i] == pattern[k]，则pattern[0 … k] == pattern[i-k, i]，进而failure[i+1] = failure[i] + 1 = k + 1；当pattern[i] ！= pattern[k]，模式匹配失败，则k = failure[k]。KMP算法如下：

public class KMPMatch {
  
  private String string;
  private String pattern;
  private int[] failure;
  private int matchPoint;
  
  public KMPMatch(String string, String pattern) {
    this.string = string;
    this.pattern = pattern;
    failure = new int[pattern.length()];
    computeFailure();
  }
      
  public int getMatchPoint() {
    return matchPoint;
  }
  
  
  public boolean match() {
    // Tries to find an occurence of the pattern in the string
    
    int j = 0;
    if (string.length() == 0) return false;
    
    for (int i = 0; i < string.length(); i++) {
      while (j > 0 && pattern.charAt(j) != string.charAt(i)) {
        j = failure[j - 1];
      }
      if (pattern.charAt(j) == string.charAt(i)) { j++; }
      if (j == pattern.length()) {
        matchPoint = i - pattern.length() + 1;
        return true;
      }
    }
    return false;
  }
  
  public boolean match1() {
   
    int i = 0;
    int j = 0;
    if (string.length() == 0) return false;
    
    while (i + pattern.length() - j <= string.length()) {
      if (j >= pattern.length()) {
        matchPoint = i - pattern.length();        
        return true;
      }
      if (string.charAt(i) == pattern.charAt(j)) {
        i++;
        j++;        
      } else {
        if (j > 0) { j = failure[j - 1]; }
        else { i++; }
      }
    }  
    return false;    
  }
  
  /** 
   * Computes the failure function using a boot-strapping process,
   * where the pattern is matched against itself.
   */
  private void computeFailure() {

    int j = 0;
    for (int i = 1; i < pattern.length(); i++) {
      while (j > 0 && pattern.charAt(j) != pattern.charAt(i)) { j = failure[j - 1]; }
      if (pattern.charAt(j) == pattern.charAt(i)) { j++; }
      failure[i] = j;
    }
  }
  
}

根据KMP的pattern的failure生成规则可以看出，kmp比较适合字符集基数较少，pattern中能够出现较多重复规则的情况下使用性能比较好，例如DNA等

三、BM

BM算法有两个要点，

（1）在对比text和pattern的时候，pattern从后往前对比

（2）在从后向前对比pattern和text的时候，c是text中当前对比的字符，如果相等正常向前对比就行；如果不相等，分两种情况处理：

1）如果c不包含在pattern中，这个时候text的对比索引可以向前进pattern的长度

2）如果c包含在pattern里，由于是从后向前对比的，所以找出pattern中最右边出现的c，将text的索引向前使text中的c跟pattern中的c对齐

由于要预先计算出text所属字符集所有字符在pattern中最后一次出现的位置，所以这个字符集不能太大，例如英文字符26个用BM来进行查找就比较合适。代码如下：

public class BoyerMoore {
    private final int R;     // the radix
    private int[] right;     // the bad-character skip array

    private char[] pattern;  // store the pattern as a character array
    private String pat;      // or as a string

    // pattern provided as a string
    public BoyerMoore(String pat) {
        this.R = 256;
        this.pat = pat;

        // position of rightmost occurrence of c in the pattern
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pat.length(); j++)
            right[pat.charAt(j)] = j;
    }

    // pattern provided as a character array
    public BoyerMoore(char[] pattern, int R) {
        this.R = R;
        this.pattern = new char[pattern.length];
        for (int j = 0; j < pattern.length; j++)
            this.pattern[j] = pattern[j];

        // position of rightmost occurrence of c in the pattern
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pattern.length; j++)
            right[pattern[j]] = j;
    }

    // return offset of first match; N if no match
    public int search(String txt) {
        int M = pat.length();
        int N = txt.length();
        int skip;
        for (int i = 0; i <= N - M; i += skip) {
            skip = 0;
            for (int j = M-1; j >= 0; j--) {
                if (pat.charAt(j) != txt.charAt(i+j)) {
                    skip = Math.max(1, j - right[txt.charAt(i+j)]);
                    break;
                }
            }
            if (skip == 0) return i;    // found
        }
        return N;                       // not found
    }


    // return offset of first match; N if no match
    public int search(char[] text) {
        int M = pattern.length;
        int N = text.length;
        int skip;
        for (int i = 0; i <= N - M; i += skip) {
            skip = 0;
            for (int j = M-1; j >= 0; j--) {
                if (pattern[j] != text[i+j]) {
                    skip = Math.max(1, j - right[text[i+j]]);
                    break;
                }
            }
            if (skip == 0) return i;    // found
        }
        return N;                       // not found
    }
}

转载于:https://my.oschina.net/diwayou/blog/149804