The Knuth-Morris-Pratt Algorithm in my own words

本文深入浅出地介绍了Knuth-Morris-Pratt (KMP) 字符串搜索算法的工作原理,重点讲解了如何构建和使用部分匹配表来提高字符串匹配效率。

For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

1
2
3
bacbababaabcbab
 |
 abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

1
2
3
bacbababaabcbab
    |||||
    abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
    xx|||
      abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
      xx|
        abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

 Dec 13th, 2009

<think>我们正在计算模式串"ababa"的Partial Match Table(PMT),也称为失配函数或next数组。 PMT的定义:对于模式串P[0..m-1],PMT[i]表示子串P[0..i]的最长相等真前缀和真后缀的长度(不包括整个子串)。 注意:真前缀和真后缀是指不包含整个字符串的前缀和后缀。 步骤: 1. 初始化:PMT[0] = 0,因为单个字符没有真前缀和真后缀。 2. 从i=1开始,计算每个位置i的PMT值。 模式串:a b a b a 位置0: P[0] = 'a' PMT[0] = 0 [因为只有一个字符] 位置1: P[0..1] = "ab" 真前缀:["a"] 真后缀:["b"] 没有相等的,所以PMT[1]=0 位置2: P[0..2] = "aba" 真前缀:["a", "ab"] 真后缀:["a", "ba"] 相等的最长真前缀和真后缀是"a",长度为1,所以PMT[2]=1 位置3: P[0..3] = "abab" 真前缀:["a", "ab", "aba"] 真后缀:["b", "ab", "bab"] 相等的最长真前缀和真后缀是"ab",长度为2,所以PMT[3]=2 位置4: P[0..4] = "ababa" 真前缀:["a", "ab", "aba", "abab"] 真后缀:["a", "ba", "aba", "baba"] 相等的最长真前缀和真后缀有:"a"(长度1)和"aba"(长度3),取最长的3,所以PMT[4]=3 但是,我们通常要求真前缀和真后缀不能是整个字符串,这里"aba"作为真后缀时,它并不等于整个字符串(整个字符串是"ababa"),所以是允许的。 然而,我们检查一下: 真前缀:P[0..2] = "aba" (索引0到2) 真后缀:P[2..4] = "aba" (索引2到4) -> 注意:这个真后缀是从索引2到4,长度为3。 因此,PMT[4]=3。 但是,我们也可以使用递推的方法(利用已经计算出的PMT值)来计算,这样更高效。 递推方法: 设i=0时,PMT[0]=0。 令j=0(表示当前匹配的前缀长度),i从1到m-1(m为模式串长度): while j>0 and P[i] != P[j]: j = PMT[j-1] # 回退到前一个匹配位置 if P[i] == P[j]: j += 1 PMT[i] = j 按照递推方法计算: 初始化:PMT[0]=0, j=0, i=1 i=1: P[1]='b', P[j]=P[0]='a' -> 不相等,且j=0,所以不进入while循环,然后比较不相等,所以PMT[1]=0,j保持0。 i=2: P[2]='a', P[j]=P[0]='a' -> 相等,则j增加1,变为1,然后PMT[2]=j=1。 i=3: P[3]='b', P[j]=P[1]='b' -> 相等,则j增加1,变为2,然后PMT[3]=2。 i=4: P[4]='a', P[j]=P[2]='a' -> 相等,则j增加1,变为3,然后PMT[4]=3。 所以,递推方法得到的结果是:[0,0,1,2,3] 因此,模式串"ababa"的Partial Match Table为:[0,0,1,2,3] 注意:有时候next数组的定义会稍微不同,比如有些实现中会将整个PMT数组向右移动一位,并将next[0]设为-1,但这里我们计算的是标准的PMT。 所以,答案是:[0,0,1,2,3]</think>### KMP算法中模式串"ababa"的Partial Match Table计算 Partial Match Table(PMT)是KMP算法的核心组件,用于在匹配失败时确定模式串的滑动距离。PMT的值表示模式串前缀与后缀的最长匹配长度(不包括整个子串)。以下是模式串$P = \text{"ababa"}$的计算过程: #### 计算步骤: 1. **$i=0$**(子串"a") - 真前缀:空集 - 真后缀:空集 - 最长匹配长度:$0$ $\text{PMT}[0] = 0$ 2. **$i=1$**(子串"ab") - 真前缀:$["a"]$ - 真后缀:$["b"]$ - 无相同前后缀 $\text{PMT}[1] = 0$ 3. **$i=2$**(子串"aba") - 真前缀:$["a", "ab"]$ - 真后缀:$["a", "ba"]$ - 最长匹配:"a"(长度$1$) $\text{PMT}[2] = 1$ 4. **$i=3$**(子串"abab") - 真前缀:$["a", "ab", "aba"]$ - 真后缀:$["b", "ab", "bab"]$ - 最长匹配:"ab"(长度$2$) $\text{PMT}[3] = 2$ 5. **$i=4$**(子串"ababa") - 真前缀:$["a", "ab", "aba", "abab"]$ - 真后缀:$["a", "ba", "aba", "baba"]$ - 最长匹配:"aba"(长度$3$) $\text{PMT}[4] = 3$ #### 最终结果: 模式串$\text{"ababa"}$的PMT为: $$[0, 0, 1, 2, 3]$$ #### 递推公式验证(高效算法): ```python def compute_pmt(pattern): pmt = [0] * len(pattern) j = 0 # 最长匹配前缀的指针 for i in range(1, len(pattern)): while j > 0 and pattern[i] != pattern[j]: j = pmt[j-1] # 回退到前一个匹配位置 if pattern[i] == pattern[j]: j += 1 pmt[i] = j return pmt print(compute_pmt("ababa")) # 输出: [0, 0, 1, 2, 3] ``` 此算法通过动态规划复用已计算的PMT值,时间复杂度为$O(m)$($m$为模式串长度)[^1]。 --- ### 相关问题 1. KMP算法中PMT的作用是什么?如何利用它优化字符串匹配? 2. 对于模式串$\text{"aabaaab"}$,如何计算其Partial Match Table? 3. KMP算法与朴素字符串匹配算法的时间复杂度有何区别? 4. 在PMT计算过程中,为什么要求匹配的前后缀必须是真子串(即不能是完整子串)? [^1]: 基于KMP算法的经典实现,参考《算法导论》字符串匹配章节。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值