精妙的KMP算法原理及实现_kmp算法精妙么-优快云博客

本文深入解析KMP算法，包括其核心思想、覆盖函数计算及优化传统字符匹配算法的方法，并通过实例展示KMP算法如何有效避免重复计算。

大数据挖掘很多时候是处理文字信息，所以字符匹配算法就显得尤为重要。

例如：在字符串“bigger than bigger”中匹配”bigger“字符串，求出匹配的起始字符串位置；

我们可以看到，共有2次匹配，分别发生在位置0和位置12，那么如何设计算法呢？

最简单朴素的想法就是挨个比过来，假设被匹配字符为target，匹配字符为pattern

先把target和pattern对齐，从第0位开始逐一比较每个字符，直到出现不同或者到了字符串的结尾；然后把pattern右移一位，重复下一轮比较，像下面这张表，红色表示匹配成功，我们就把这个位置记录下来：

该算法的核心实现是这样的：

public List<Integer> match(char[] target, char[] pattern)
{
<span style="white-space:pre">	</span>count = 0;
<span style="white-space:pre">	</span>
<span style="white-space:pre">	</span>List<Integer> results = new ArrayList<Integer>();
<span style="white-space:pre">	</span>
<span style="white-space:pre">	</span>int target_index = 0;
<span style="white-space:pre">	</span>int target_temp_index = 0;
<span style="white-space:pre">	</span>int pattern_index = 0;
<span style="white-space:pre">	</span>int target_length = target.length;
<span style="white-space:pre">	</span>int pattern_length = pattern.length;
<span style="white-space:pre">	</span>
<span style="white-space:pre">	</span>// Check if the input values are correct
<span style="white-space:pre">	</span>if ((target_length < pattern_length) || (target_length == 0) || (pattern_length==0))
<span style="white-space:pre">		</span>return results;
<span style="white-space:pre">	</span>
<span style="white-space:pre">	</span>while (target_index < target_length)
<span style="white-space:pre">	</span>{
<span style="white-space:pre">		</span>target_temp_index = target_index;
<span style="white-space:pre">		</span>
<span style="white-space:pre">		</span>while ((target_temp_index < target_length) && (pattern_index < pattern_length) && (target[target_temp_index] == pattern[pattern_index]))
<span style="white-space:pre">		</span>{
<span style="white-space:pre">			</span>target_temp_index++;
<span style="white-space:pre">			</span>pattern_index++;
<span style="white-space:pre">			</span>
<span style="white-space:pre">			</span>count++;
<span style="white-space:pre">		</span>}
<span style="white-space:pre">		</span>
<span style="white-space:pre">		</span>if (pattern_index == pattern_length)
<span style="white-space:pre">		</span>{
<span style="white-space:pre">			</span>// Found one match of the pattern
<span style="white-space:pre">			</span>results.add(Integer.valueOf(target_index + 1));
<span style="white-space:pre">		</span>}


<span style="white-space:pre">		</span>target_index++;
<span style="white-space:pre">		</span>pattern_index = 0;
<span style="white-space:pre">		</span>
<span style="white-space:pre">		</span>count++;
<span style="white-space:pre">	</span>}
<span style="white-space:pre">	</span>
<span style="white-space:pre">	</span>return results;
}

这种思路很直观，不难理解，就不多解释了。

假设target的长度为n，pattern的长度为m，最坏的情况下算法的复杂度为O((n-m+1) * m)，例如在”aaaaaaaa“中匹配”aaab“，就需要(8-4+1)*4=20次比较

但是，显然这种算法的效率并不是很好，有很多的重复计算，有没有可以改进的办法呢？

我们可以观察到，在第一轮index=0的匹配（也就是图中第3行标红的）完成后，我们已经知道target的位置0到位置5的字符为”bigger“；并且我们也知道pattern的起始字符是”b“，那么既然target在1号位到5号位的字符都不是”b“，那么上图里第2轮到第6轮比较就可以完全跳过，第一轮结束后直接从6号位置开始下一轮比较。

利用pattern字符串自身的特性来跳过不必要的匹配位置，就是KMP算法的核心思想。

先来看下KMP算法中一个极为重要的概念：覆盖函数（Overlay_function）

对于字符串a0a1a...aj，找到一个最大的值k，使得满足a0a1..ak = a(j-k)a(j-k+1)...aj

比如：

计算k值可以采用递推算法：假设字符串pattern前j个字符的overlay值为k，那么对于pattern[j+1]，有以下2种情况：

1. pattern[k+1]==pattern[j+1]，那么覆盖函数overlay(j+1) = k+1 = overlay(j)+1

2. pattern[k+1]≠pattern[j+1]，此时只能在pattern 前 k+1 个字符组成的子串中找到相应的overlay 函数，h=overlay(k)，如果此时 pattern[h+1]==pattern[j+1]，则overlay(j+1)=h+1，否则重复(2)过程.

public void compute_overlay(char[] pattern)
{
	int pattern_length = pattern.length;
	
	// overlay is designed to store the overlay value of each sub-pattern 
	// of pattern, these value can be re-used while matching target string
	overlay = new int[pattern_length];
	int k = 0;
	
	// overlay[0] = -1 means when pattern has only 1 character, k = -1
	overlay[0] = -1; 
	
	for (int i = 1; i<pattern_length; i++)
	{
		// Store the overlay value of previous sub-pattern into k
		k = overlay[i-1];
		
		while ((k >= 0) && (pattern[i] != pattern[k+1]))
		{
			k = overlay[k];
		}
		
		if (pattern[i] == pattern[k+1])
		{
			overlay[i] = k+1;
		}
		else
		{
			// k < 0, means no overlay found
			overlay[i] = -1;
		}
	}
}

KMP算法对pattern字符串先做预处理，计算出每个长度的子字符串对应的k值，之后在本文开始介绍的简单匹配的算法基础上，当遇到不匹配的时候，不是将pattern右移1位，而是移动到 j-overlay[j] 的位置（假设匹配字符串长度为j），这样不必要的比较步骤就被跳过了，算法得到了优化

利用pattern字符串本身的特性来寻找覆盖函数的想法非常精妙，也不是很容易理解，最后我们用图来描述下KMP算法的过程，希望能够帮助大家理解