String Matching algorithm

See more on github

Rabin-Karp

We can view a string of k characters (digits) as a length-k decimal number. E.g., the string “31425” corresponds to the decimal number 31,425.

  • Given a pattern P [1…m], let p denote the corresponding decimal value.
  • Given a text T [1…n], let t s t_s ts denote the decimal value of the length-m substring T [(s+1)…(s+m)] for s=0,1,…,(n-m).
  • let d be the radix of num, thus d = l e n ( s e t ( s ) ) d = len(set(s)) d=len(set(s))
  • t s t_s ts = p iff T [(s+1)…(s+m)] = P [1…m].
  • p can be computed in O(m) time. p = P[m] + d*(P[m-1] + d*(P[m-2]+…)).
  • t0 can similarly be computed in O(m) time.
  • Other t 1 , … , t n − m t_1,\ldots,t_{n-m} t1,,tnm can be computed in O(n-m) time since $t_{s+1} can be computed from ts in constant time.
    Namely,

t s + 1 = d ∗ ( t s − d m − 1 ∗ T [ s + 1 ] ) + T [ s + m + 1 ] t_{s+1} = d*(t_s-d^{m-1} * T[s+1])+T[s+m+1] ts+1=d(tsdm1T[s+1])+T[s+m+1]
However, it’s no need to calculate t s + 1 t_{s+1} ts+1 directly. We can use modulus operation to reduce the work of caculation.

We choose a small prime number. Eg 13 for radix( noted as d) 10.
Generally, d*q should fit within one computer word.

We firstly caculate t0 mod q.
Then, for every t i ( i > 1 ) t_i (i>1) ti(i>1)
assume
t i − 1 = T [ i + m − 1 ] + 10 ∗ T [ i + m − 2 ] + … + 1 0 m − 1 ∗ T [ i − 1 ] t_{i-1} = T[i+m-1] + 10*T[i+m-2]+\ldots+10^{m-1}*T[i-1] ti1=T[i+m1]+10T[i+m2]++10m1T[i1]
denote $ d’ = d^{m-1}\ mod\ q$
thus,
t i = ( t i − 1 − d m − 1 ∗ T [ i − 1 ] ) ∗ d + T [ i + m ] ≡ ( t i − 1 − d m − 1 ∗ T [ i − 1 ] ) ∗ d + T [ i + m ] ( m o d   q ) ≡ ( t i − 1 − ( d m − 1 m o d   q ) ∗ T [ i − 1 ] ) ∗ d + T [ i + m ] ( m o d   q ) ≡ ( t i − 1 − d ′ ∗ T [ i − 1 ] ) ∗ d + T [ i + m ] ( m o d   q ) \begin{aligned} t_i &= (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m]\\ &\equiv (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m] (mod\ q)\\ &\equiv (t_{i-1}- ( d^{m-1} mod \ q) *T[i-1]) * d + T[i+m] (mod\ q)\\ &\equiv (t_{i-1}- d'*T[i-1]) * d + T[i+m] (mod\ q) \end{aligned} ti=(ti1dm1T[i1])d+T[i+m](ti1dm1T[i1])d+T[i+m](mod q)(ti1(dm1mod q)T[i1])d+T[i+m](mod q)(ti1dT[i1])d+T[i+m](mod q)

So we can compare the modular value of each ti with p’s.
Only if they are the same, then we compare the origin chracter, namely T [ i ] , T [ i + 1 ] , … , T [ i + m − 1 ] T[i],T[i+1],\ldots,T[i+m-1] T[i],T[i+1],,T[i+m1] and the pattern.
Gernerally, this algorithm’s time approximation is O(n+m), and the worst case is O((n-m+1)*m)

Problem: this is assuming p and ts are small numbers. They may be too large to work with easily.

python implementation

#coding: utf-8
''' mbinary
#########################################################################
# File : rabin_karp.py
# Author: mbinary
# Mail: zhuheqin1@gmail.com
# Blog: https://mbinary.coding.me
# Github: https://github.com/mbinary
# Created Time: 2018-12-11  00:01
# Description: rabin-karp algorithm
#########################################################################
'''

def isPrime(x):
    for i in range(2,int(x**0.5)+1):
        if x%i==0:return False
    return True
def getPrime(x):
    '''return a prime which is bigger than x'''
    for i in range(x,2*x):
        if isPrime(i):return i
def findAll(s,p):
    '''s: string   p: pattern'''
    dic={}
    n,m = len(s),len(p)
    d=0 #radix
    for c in s:
        if c not in dic:
            dic[c]=d
            d+=1
    sm = 0
    for c in p:
        if c not in dic:return [-1]
        sm = sm*d+dic[c]

    ret = []
    cur = 0
    for i in range(m): cur=cur*d + dic[s[i]]
    if cur==sm:ret.append(0)
    tmp = n-m
    q = getPrime(m)
    cur = cur%q
    sm = sm%q
    exp = d**(m-1) % q
    for i in range(m,n):
        cur = ((cur-dic[s[i-m]]*exp)*d+dic[s[i]]) % q
        if cur == sm and p==s[i-m+1:i+1]:
            ret.append(i-m+1)
    return ret

def randStr(n=3):
    return [randint(ord('a'),ord('z')) for i in range(n)]

if __name__ =='__main__':
    from random import randint
    s = randStr(50)
    p = randStr(1)
    print(s)
    print(p)
    print(findAll(s,p))

FSM

A FSM can be represented as (Q,q0,A,S,C), where

  • Q is the set of all states
  • q0 is the start state
  • A ∈ Q A\in Q AQ is a set of accepting states.
  • S is a finite input alphabet.
  • C is the set of transition functions: namely q j = c ( s , q i ) q_j = c(s,q_i) qj=c(s,qi).

Given a pattern string S, we can build a FSM for string matching.
Assume S has m chars, and there should be m+1 states. One is for the begin state, and the others are for matching state of each position of S.

Once we have built the FSM, we can run it on any input string.

KMP

Knuth-Morris-Pratt method

The idea is inspired by FSM. We can avoid computing the transition functions. Instead, we compute a prefix functiNext on P in O(m) time, and Next has only m entries.

Prefix funtion stores info about how the pattern matches against shifts of itself.

  • String w is a prefix of string x, if x=wy for some string y
  • String w is a suffix of string x, if x=yw for some string y
  • The k-character prefix of the pattern P [1…m] denoted by Pk.
  • Given that pattern prefix P [1…q] matches text characters T [(s+1)…(s+q)], what is the least shift s’> s such that P [1…k] = T [(s’+1)…(s’+k)] where s’+k=s+q?
  • At the new shift s’, no need to compare the first k characters of P with corresponding characters of T.
    Method: For prefix pi, find the longest proper prefix of pi that is also a suffix of pi.
    next[q] = max{k|k<q and pk is a suffix of pq}

For example: p = ababaca, for p5 = ababa, Next[5] = 3. Namely p3=aba is the longest prefix of p that is also a suffix of p5.

Time approximation: finding prefix function next take O(m), matching takes O(m+n)

python implementation

#coding: utf-8
''' mbinary
#########################################################################
# File : KMP.py
# Author: mbinary
# Mail: zhuheqin1@gmail.com
# Blog: https://mbinary.coding.me
# Github: https://github.com/mbinary
# Created Time: 2018-12-11  14:02
# Description:
#########################################################################
'''

def getPrefixFunc(s):
    '''return the list of prefix function of s'''
    length = 0
    i = 1
    n = len(s)
    ret = [0]
    while i<n:
        if s[i]==s[length]:
            length +=1
            ret.append(length)
            i+=1
        else:
            if length==0:
                ret.append(0)
                i+=1
            else:
                length = ret[length-1]
    return ret

def findAll(s,p):
    pre = getPrefixFunc(p)
    i = j  =0
    n,m = len(s),len(p)
    ret = []
    while i<n:
        if s[i]==p[j]:
            i+=1
            j+=1
            if j==m:
                ret.append(i-j)
                j=pre[j-1]
        else:
            if j==0: i+=1
            else: j = pre[j-1]
    return ret
def randStr(n=3):
    return [randint(ord('a'),ord('z')) for i in range(n)]

if __name__ =='__main__':
    from random import randint
    s = randStr(50)
    p = randStr(1)
    print(s)
    print(p)
    print(findAll(s,p))

Boyer-Moore

  • The longer the pattern is, the faster it works.
  • Starts from the end of pattern, while KMP starts from the beginning.
  • Works best for character string, while KMP works best for binary string.
  • KMP and Boyer-Moore
    • Preprocessing existing patterns.
    • Searching patterns in input strings.

Sunday

features

  • simplification of the Boyer-Moore algorithm;
  • uses only the bad-character shift;
  • easy to implement;
  • preprocessing phase in O(m+sigma) time and O(sigma) space complexity;
  • searching phase in O(mn) time complexity;
  • very fast in practice for short patterns and large alphabets.

description

The Quick Search algorithm uses only the bad-character shift table (see chapter Boyer-Moore algorithm). After an attempt where the window is positioned on the text factor y[j … j+m-1], the length of the shift is at least equal to one. So, the character y[j+m] is necessarily involved in the next attempt, and thus can be used for the bad-character shift of the current attempt.

The bad-character shift of the present algorithm is slightly modified to take into account the last character of x as follows: for c in Sigma, qsBc[c]=min{i : 0 < i leq m and x[m-i]=c} if c occurs in x, m+1 otherwise (thanks to Darko Brljak).

The preprocessing phase is in O(m+sigma) time and O(sigma) space complexity.

During the searching phase the comparisons between pattern and text characters during each attempt can be done in any order. The searching phase has a quadratic worst case time complexity but it has a good practical behaviour.

For instance,
image.png

In this example, t0, …, t4 = a b c a b is the current text window that is compared with the pattern. Its suffix a b has matched, but the comparison c-a causes a mismatch. The bad-character heuristics of the Boyer-Moore algorithm (a) uses the “bad” text character c to determine the shift distance. The Horspool algorithm (b) uses the rightmost character b of the current text window. The Sunday algorithm © uses the character directly right of the text window, namely d in this example. Since d does not occur in the pattern at all, the pattern can be shifted past this position.

python implementation

''' mbinary
#########################################################################
# File : sunday.py
# Author: mbinary
# Mail: zhuheqin1@gmail.com
# Blog: https://mbinary.coding.me
# Github: https://github.com/mbinary
# Created Time: 2018-07-11  15:26
# Description: 字符串模式匹配, sunday 算法, kmp 的改进
#               pattern matching for strings using sunday algorithm
#########################################################################
'''



def getPos(pattern):
    dic = {}
    for i,j in enumerate(pattern[::-1]):
        if j not in dic:
            dic[j]= i
    return dic
def find(s,p):
    dic = getPos(p)
    ps = pp = 0
    ns = len(s)
    np = len(p)
    while ps<ns and pp<np:
        if s[ps] == p[pp]:
            ps,pp = ps+1,pp+1
        else:
            idx = ps+ np-pp
            if idx >=ns:return -1
            ch = s[idx]
            if ch in dic:
                ps += dic[ch]+1-pp
            else:
                ps = idx+1
            pp = 0
    if pp==np:return ps-np
    else:
        return -1
def findAll(s,p):
    ns = len(s)
    np = len(p)
    i = 0
    ret = []
    while s:
        print(s,p)
        tmp = find(s,p)
        if tmp==-1: break
        ret.append(i+tmp)
        end = tmp+np
        i +=end
        s = s[end:]
    return ret



def randStr(n=3):
    return [randint(ord('a'),ord('z')) for i in range(n)]

def test(n):
    s = randStr(n)
    p = randStr(3)
    str_s = ''.join((chr(i) for i in s))
    str_p = ''.join((chr(i) for i in p))
    n1 = find(s,p)
    n2 = str_s.find(str_p) # 利用已有的 str find 算法检验
    if n1!=n2:
        print(n1,n2,str_p,str_s)
        return False
    return True
if __name__ =='__main__':
    from random import randint
    n = 1000
    suc = sum(test(n) for i in range(n))
    print('test {n} times, success {suc} times'.format(n=n,suc=suc))

Reference:

  1. Xuyun, ppt, String matching
  2. Sunday-algorithm
  3. GeeksforGeeks, KMP Algorithm
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值