注:本文大致翻译自EXACT STRING MATCHING ALGORITHMS,去掉一些废话,增加一些解释。
本文的算法一律输出全部的匹配位置。模式串在代码中用x[m]来表示,文本用y[n]来,而所有字符串都构造自一个有限集的字母表Σ,其大小为σ。
三、位运算的魔法——KR与SO
位运算经常能做出一些不可思议的事情来,例如不用临时变量要交换两个数该怎么做呢?一个没接触过这类问题的人打死他也想不出来。如果拿围棋来做比喻,那么位运算可以喻为编程中的“手筋”。
按位的存储方式能提供最大的存储空间利用率,而随着空间被压缩的同时,由于CPU硬件的直接支持,速度竟然神奇般的提升了。举个例子,普通的数组要实现移位操作,那是O(n)的时间复杂度,而如果用位运算中的移位,就是一个指令搞定了。
KR算法
Karp-Rabin algorithm
特点:
- uses an hashing function;
- preprocessing phase in O(m) time complexity and constant space;
- searching phase in O(mn) time complexity;
- O(n+m) expected running time.
Hashing provides a simple method to avoid a quadratic number of character comparisons in most practical situations. Instead of checking at each position of the text if the pattern occurs, it seems to bemore efficient to check only if the contents of the window “looks like” the pattern. In order to check the resemblance between these two words an hashing function is used.
-
To be helpful for the string matching problem an hashing function
hash should have the following properties:
-
efficiently computable; -
highly discriminating for strings; -
hash(y[j+1 .. j+m]) must be easily computable from hash(y[j .. j+m-1]) and y[j+m]:
hash(y[j+1 .. j+m])= rehash(y[j], y[j+m], hash(y[j .. j+m-1]).
For a word w of length m let hash(w) be defined as follows:
hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q
where q is a large number.
Then, rehash(a,b,h)= ((h-a*2m-1)*2+b) mod q
The preprocessing phase of the Karp-Rabin algorithm consists in computing hash(x). It can be done in constant space and O(m) time.
During searching phase, it is enough to compare hash(x) with hash(y[j .. j+m-1]) for 0 j < n-m. If an equality is found, it is still necessary to check the equality x=y[j .. j+m-1] character by character.
The time complexity of the searching phase of the Karp-Rabin algorithm is O(mn) (when searching for am in an for instance). Its expected number of text character comparisons is O(n+m).
- #define REHASH(a, b, h) ((((h) - (a)*d) << 1) + (b))
- void KR(char *x, int m, char *y, int n) {
- int d, hx, hy, i, j;
- /* Preprocessing */
- /* computes d = 2^(m-1) with
- the left-shift operator */
- for (d = i = 1; i < m; ++i)
- d = (d<<1);
- for (hy = hx = i = 0; i < m; ++i) {
- hx = ((hx<<1) + x[i]);
- hy = ((hy<<1) + y[i]);
- }
- /* Searching */
- j = 0;
- while (j <= n-m) {
- if (hx == hy && memcmp(x, y + j, m) == 0)
- OUTPUT(j);
- hy = REHASH(y[j], y[j + m], hy);
- ++j;
- }
- }
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[0 .. 7]) = 17819
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[1 .. 8]) = 17533
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[2 .. 9]) = 17979
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[3 .. 10]) = 19389
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[4 .. 11]) = 17339
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||||||||||||||
G | C | A | G | A | G | A | G |
hash(y[5 .. 12]) = 17597
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[6 .. 13]) = 17102
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[7 .. 14]) = 17117
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[8 .. 15]) = 17678
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[9 .. 16]) = 17245
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[10 .. 17]) = 17917
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[11 .. 18]) = 17723
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[12 .. 19]) = 18877
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[13 .. 20]) = 19662
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[14 .. 21]) = 17885
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[15 .. 22]) = 19197
G | C | A | T | C | G | C | A | G | A | G | A | G | T | A | T | A | C | A | G | T | A | C | G |
G | C | A | G | A | G | A | G |
hash(y[16 .. 23]) = 16961
The Karp-Rabin algorithm performs 8 character comparisons on the example.
Shift Or 算法
特点:
- uses bitwise techniques;
- efficient if the pattern length is no longer than the memory-word size of the machine;
- preprocessing phase in O(m +
) time and space complexity;
- searching phase in O(n) time complexity (independent from the alphabet size and the pattern length);
- adapts easily to approximate string matching.
- #define WORDSIZE sizeof(int)*8
- #define ASIZE 256
- int preSo(const char *x, int m, unsigned int S[]) {
- unsigned int j, lim;
- int i;
- for (i = 0; i < ASIZE; ++i)
- S[i] = ~0;
- for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
- S[x[i]] &= ~j;
- lim |= j;
- }
- lim = ~(lim>>1);
- return(lim);
- }
- void SO(const char *x, int m, const char *y, int n) {
- unsigned int lim, state;
- unsigned int S[ASIZE];
- int j;
- if (m > WORDSIZE)
- error("SO: Use pattern size <= word size");
- /* Preprocessing */
- lim = preSo(x, m, S);
- /* Searching */
- for (state = ~0, j = 0; j < n; ++j) {
- state = (state<<1) | S[y[j]];
- if (state < lim)
- OUTPUT(j - m + 1);
- }
- }
示例:
As R12[7]=0 it means that an occurrence of x has been found at position 12-8+1=5.
(R12[7]=0,说明与模式字符匹配的字符串在12 - 8 + 1 = 5 位置上出现。)
preSo函数中第二个for循环后,lim = 2m - 1。最后,lim为二进制数:11111111。然后lim = ~(lim>>1) = 10000000。
preSo函数第一个for循环,把所有字符在模式中出现的位置S[x[i]]全部初始化为全1数。
preSo函数第二个for循环:
i = 0,j = 1 = 00000001, S(x[i]) = S[G] = S[G] & ~j = 11111110,lim = 00000001;
i = 1,j = 2 = 00000010, S(x[i]) = S[C] = S[C] & ~j = 11111101,lim = 00000011;
i = 2,j = 4 = 00000100, S(x[i]) = S[A] = S[A] & ~j = 11111011,lim = 00000111;
i = 3,j = 8 = 00001000, S(x[i]) = S[G] = S[G] & ~j = 11110110,lim = 00001111;
i = 4,j = 16 = 00010000, S(x[i]) = S[A] = S[A] & ~j = 11101011,lim = 00011111;
i = 5,j = 32 = 00100000, S(x[i]) = S[G] = S[G] & ~j = 11010110,lim = 00111111;
i = 6,j = 64 = 01000000, S(x[i]) = S[A] = S[A] & ~j = 10101011,lim = 01111111;
i = 7,j = 128 = 10000000, S(x[i]) = S[G] = S[G] & ~j = 01010110,lim = 11111111;
最后:
S[A] = 10101011
S[C] = 11111101
S[G] = 01010110
其它为全1。
SO函数中for循环中的state对应上图中的每一个竖列,分解:
j = 0,state = 11111111 | S[G] = 11111110;
j = 1,state = 11111100 | S[C] = 11111101;
j = 2,state = 11111010 | S[A] = 11111011;
j = 3,state = 11110110 | S[T] = 11111111;
j = 4,state = 11111110 | S[C] = 11111111;
j = 5,state = 11111110 | S[G] = 11111110;
j = 6,state = 11111100 | S[C] = 11111101;
j = 7,state = 11111010 | S[A] = 11111011;
j = 8,state = 11110110 | S[G] = 11110110;
j = 9,state = 11101100 | S[A] = 11101111;
j = 10,state = 11011110 | S[G] = 11011110;
j = 11,state = 10111100 | S[A] = 10111111;
j = 12,state = 01111110 | S[G] = 01111110;
......
上图中0为最低位,7为最高位。所以只有在j = 12时,才出现最高位为0,小于lim。
参考: