Sunday算法思路
Sunday算法的思想和BM算法中的坏字符思想非常类似。
差别只是在于Sunday算法在失配之后,是取目标串中当前和模式串对应的部分后面一个位置的字符来做坏字符匹配。
如下图:
下标数:01234567890
目标串:abcdefghijk
模式串:bxcd
BM算法在b和x失配后,坏字符为b(下标1),在模式串中寻找b的位置,找到之后就对应上,移到下面位置继续匹配。
目标串:abcdefghijk
模式串: bxcd
而在sunday算法中,对于上面的匹配,发现失配后,是取目标串中和模式串对应部分后面的一个字符,也就是e,然后用e来做坏字符匹配。
e在模式串中没有,所以使用sunday算法,接下来会移动到下面的位置继续匹配。
目标串:abcdefghijk
模式串: bxcd
从这里可以看出,Sunday算法比BM算法的位移更大,所以Sunday算法比BM算法的效率更高。但是最坏的时间复杂度仍然有o(目标串长度*模式串长度)。考虑这样的目标串:baaaabaaaabaaaabaaaa,要在里面搜索aaaaa,显然是没有匹配位置。但是如果用Sunday算法,坏字符大部分都是a,而模式串中又全部都是a,所以在大部分情况下,发现失配后模式串只能往右移动1位。而如果用改进的KMP算法,仍然是可以保证线性时间内匹配完。
另外,使用Sunday算法不需要固定地从左到右匹配或者从右到左的匹配(这是因为失配之后我们用的是目标串中后一个没有匹配过的字符),
我们可以对模式串中的字符出现的概率事先进行统计,每次都使用概率最小的字符所在的位置来进行比较,这样失配的概率会比较大,所以可以减少比较次数,加快匹配速度。
如下面的例子:
目标串:abcdefghijk
模式串:aabcc
模式串中b只出现了一次,a,c都出现了2次,所以我们可以先比较b所在的位置(只看模式串中的字符的话,b失配的概率会比较大)。
总之,Sunday算法简单易懂,思维跳出常规匹配的想法,从概率上来说,其效率在匹配随机的字符串时比其他匹配算法还要更快。
完整的Sunday算法
- #include <stdio.h>
- #include <string.h>
- bool BadChar(const char *pattern, int nLen, int *pArray, int nArrayLen)
- {
- if (nArrayLen < 256)
- {
- return false;
- }
- for (int i = 0; i < 256; i++)
- {
- pArray[i] = -1;
- }
- for (int i = 0; i < nLen; i++)
- {
- pArray[pattern[i]] = i;
- }
- return true;
- }
- int SundaySearch(const char *dest, int nDLen,
- const char *pattern, int nPLen,
- int *pArray)
- {
- if (0 == nPLen)
- {
- return -1;
- }
- for (int nBegin = 0; nBegin <= nDLen-nPLen; )
- {
- int i = nBegin, j = 0;
- for ( ;j < nPLen && i < nDLen && dest[i] == pattern[j];i++, j++);
- if (j == nPLen)
- {
- return nBegin;
- }
- if (nBegin + nPLen > nDLen)
- {
- return -1;
- }
- else
- {
- nBegin += nPLen - pArray[dest[nBegin+nPLen]];
- }
- }
- return -1;
- }
- void TestSundaySearch()
- {
- int nFind;
- int nBadArray[256] = {0};
- // 1 2 3 4
- //0123456789012345678901234567890123456789012345678901234
- const char dest[] = "abcxxxbaaaabaaaxbbaaabcdamno";
- const char pattern[][40] = {
- "a",
- "ab",
- "abc",
- "abcd",
- "x",
- "xx",
- "xxx",
- "ax",
- "axb",
- "xb",
- "b",
- "m",
- "mn",
- "mno",
- "no",
- "o",
- "",
- "aaabaaaab",
- "baaaabaaa",
- "aabaaaxbbaaabcd",
- "abcxxxbaaaabaaaxbbaaabcdamno",
- };
- for (int i = 0; i < sizeof(pattern)/sizeof(pattern[0]); i++)
- {
- BadChar(pattern[i], strlen(pattern[i]), nBadArray, 256);
- nFind = SundaySearch(dest, strlen(dest), pattern[i], strlen(pattern[i]), nBadArray);
- if (-1 != nFind)
- {
- printf("Found \"%s\" at %d \t%s\r\n", pattern[i], nFind, dest+nFind);
- }
- else
- {
- printf("Found \"%s\" no result.\r\n", pattern[i]);
- }
- }}
- int main(int argc, char* argv[])
- {
- TestSundaySearch();
- return 0;
- }
输出结果:
- Found "a" at 0 abcxxxbaaaabaaaxbbaaabcdamno
- Found "ab" at 0 abcxxxbaaaabaaaxbbaaabcdamno
- Found "abc" at 0 abcxxxbaaaabaaaxbbaaabcdamno
- Found "abcd" at 20 abcdamno
- Found "x" at 3 xxxbaaaabaaaxbbaaabcdamno
- Found "xx" at 3 xxxbaaaabaaaxbbaaabcdamno
- Found "xxx" at 3 xxxbaaaabaaaxbbaaabcdamno
- Found "ax" at 14 axbbaaabcdamno
- Found "axb" at 14 axbbaaabcdamno
- Found "xb" at 5 xbaaaabaaaxbbaaabcdamno
- Found "b" at 1 bcxxxbaaaabaaaxbbaaabcdamno
- Found "m" at 25 mno
- Found "mn" at 25 mno
- Found "mno" at 25 mno
- Found "no" at 26 no
- Found "o" at 27 o
- Found "" no result.
- Found "aaabaaaab" no result.
- Found "baaaabaaa" at 6 baaaabaaaxbbaaabcdamno
- Found "aabaaaxbbaaabcd" at 9 aabaaaxbbaaabcdamno
- Found "abcxxxbaaaabaaaxbbaaabcdamno" at 0 abcxxxbaaaabaaaxbbaaabcdamno