Formalization:

  • Text: (T[1..n])
  • Pattern: (P[1..m]) ((m\le n))
  • elements of (P) and (T) are characters drawn from a finite alphabet (\Sigma).
  • (\Sigma^*) denotes the set of all finite-length string using (\Sigma)
  • (\varepsilon\in \Sigma ^*) emptry string
  • (x y) denotes concatenation of (x) and (y)
  • (w\sqsubset x): (w) is a prefix of (x), i.e. (x=wy) for some (y\in \Sigma^*)
  • (w\sqsupset x): (w) is a suffix of (x), i.e. (x=yw) for some (y\in \Sigma^*)
  • (P_k) is the (k)-character prefix (P[q..k])

Comparison of Algorithms:

Algorithm Preprocessing time Matching Time
Naive 0 (O((n-m+1)m))
Rabin-Karp (\Theta(m)) (O((n-m+1)m))
Finite Automaton (O(m\vert \Sigma\vert)) (\Theta(n))
Knuth-Morris-Pratt (\Theta(m)) (\Theta(n))
Z-Algorithm 0 (\Theta(m+n))

Rabin-Karp algorithm

Based on certain assumptions its average-case running time is better.

Let us assume (\Sigma={0,1,2,\ldots,d}). Let (p) denote the interpretation of (P[1..m]) as a radix-d number. Similarly let (t_s) denote the interpretatin of (T[s+1\ldots s+m]) for (s\in {0,1,\dots n-m}). Now (t_s=p\iff T[s+1\ldots s+m]=P[1..m]).

We can compute (p) in (\Theta(m)) using Horner’s rule

(p=P[m]+10(P[m-1]+10(P[m-2]+\ldots)))

Similarly

(t_{s+1}=10(t_s-10^{m-1}T[s+1])+T[s+m+1])

where we can calculate (10^{m-1}) in (O(\lg m)).

For string matching we work with (\mathbb Z_q) where (q) is a prime such that (dq) fits inside a computer word. However (t_{s} \equiv p\pmod q) doesn’t imply that (t_s=p), but (t_s\not\equiv p\mod q\implies t_s\neq p). So we additionaly check explicity string equality.

In worst case it could be (\Theta((n-m+1)m)) when all shifts are valid.

In average case we expect only (c) shifts to be valid, then matching time is (O((n-m+1)+cm)=O(n+m)) plus time required to process spurious hits. We can expect the number of spurious hits to be (O(n/q)) since (t_s\equiv p\pmod q) with probability (1/q). This makes the expected matching time to be (O(n)+O(m(\nu+n/q))) where (\nu) is number of valid shifts. This is (O(n)) if (\nu=O(1)) (expected number of valid shifts is small) and (q\ge m) ((q) larger than the length of pattern)

Finite Automata

Definition

A finite automaton (M(Q, q_0, A, \Sigma, \delta)) where

  • (Q): states,
  • (q_0\in Q): start state,
  • (A\subseteq Q): accepting states,
  • (\Sigma): input alphabet
  • (\delta:Q\times \Sigma\mapsto Q): transition function

(M) induces a function (\phi:\Sigma^*\mapsto Q) (final-state function) such that (\phi(w)) is the state (M) ends up in after scanning the string (w). (M) accepts a string (w) iff (\phi(w)\in A).

(\phi(\varepsilon)=q_0)

(\phi(wa)=\delta(\phi(w), a)\text{ for }w\in\Sigma^*, a\in \Sigma)

String-matching automata

For (P) we generate a string-matching automaton in a preprocessing step.

For this we define an auxilary function (\sigma:\Sigma^*\mapsto{0,1,\ldots,m}), the suffix function corresponding to (P). (\sigma(x)) is the length of the longest prefix of (P) that is also a suffix of (x).

(\sigma(x)=\max{k: P_k \sqsupset x})

We have (\sigma(x)=m\iff P\sqsupset x), also (x\sqsubset y\implies \sigma(x)\le \sigma(y)) from definition.

We define string-matching automaton as:

  • (Q={0,1,\ldots,m}), (q_0=0), (A={m})
  • (\delta(q, a)=\sigma(P_qa)) is the longest prefix of pattern (P) that has matched the text string (T) so far.

We consider most recently read characters of (T). For (T_i=P_j), (P_j\sqsupset T_i).

Let (q=\phi(T_i)). We design (\delta) such that at (q), (P_q\sqsupset T_i) and (q=\sigma(T_i)). Therefore (\phi(T_i)=\sigma(T_i)).

If the automaton is in state (q) and reads the next characters (T[i+1]=a), then we want the transition to lead to the state corresponding to the longest prefix of (P) that is a suffix of (T_ia), i.e. (\sigma(T_ia)).

Because (P_q) is the longest prefix of (P) that is a suffix of (T_i) , i.e. (P_q=\sigma(T_i)) we also have the longest prefix of (P) that is a suffix of (T_ia) is not only (\sigma(T_ia)) but also (\sigma(P_qa)).

  • If (a=P[q+1]), (\delta(q, a)=q+1).
  • If (a\neq P[q+1]) we find a smaller prefix of (P) that is also a suffix of (T_i).

Kunth-Morris-Pratt Algorithm

Given a pattern (P[1..m]), the prefix function for the pattern (P) is the function (\pi:{1,2,\ldots, m}\mapsto{0,1,\ldots, m-1}) such that

(\pi[q]=\max{k:k<q\wedge P_k\sqsupset P_q})

i.e. (\pi[q]) is the length of the longest prefix of (P) that is a proper suffix of (P_q).

For matching:

  • Scan the text (T) from left to right starting with (q=0) (number of characters matched is (0) initially):
    • While (q>0) and next character doesn’t match, i.e. (P[q+1]\neq T[i]), keep doing (q\leftarrow\pi[q])
    • If next character matches, i.e. (P[q+1]=T[i]), then (q\leftarrow q+1)
    • If (q=m), pattern is found, and set (q\leftarrow\pi[q])
        q = 0
        for i = 0 to n - 1:
        	while (k > 0 && T[i] != P[q]):
        		q = pi[q-1]
        	if (T[i] == P[q]):
        		q++
        	if (q == m):
        		# Pattern Found
        		q = pi[q - 1]
        	pi[q] = k
      

For computing prefix-function:

  • (\pi[1]\leftarrow0) by definition

  • Scan the pattern (P) from left to right starting from (q=2) 2 (loop starts from 2nd character, 1-indexed) and (k=0) (number of characters matched in 0 initially):

    • While (k>0) and next character doesn’t match, i.e. (P[k+1]\ne P[q]), keep doing (k\leftarrow\pi[k])
    • If next character matches, i.e. (P[k+1]=P[q]), then (k\leftarrow k+1).
    • Set (\pi[q]\leftarrow k)
      pi[1] = 0
      k = 0
      for q = 1 to m - 1:
      	while (k > 0 && P[k] != P[q]):
      		k = pi[k-1]
      	if (P[k] == P[q]):
      		k++
      	pi[q] = k
    

Z-Algorithm

The algorithm produces an array (Z) where (Z[i]) denotes the length of the longest substring starting from (S[i]) which is also a prefix of (S). We refer to prefix-substrings as substrings which are also prefix.

As we iterate over the letters in the string we maintain an interval ([L, R]) which is the interval with maximum (R) such that (1\le L\le i\le R) and (S[L\ldots R])is a prefix-substring (if no such inteval exists (L=R=-1))

Now suppose we have correct interval ([L, R]) for (i-1) and for all values of (Z) upto (i-1), we will compute (Z[i]) and new ([L, R]) by following steps:

  • (i>R):
    • there is not a prefix-substring of (S) that starts before (i) and ends at (\ge i). We would have taken that interval instead.
    • Set ([L,R]) to ([i,i]) and start comparing (S[0…]) to (S[i…]) and (Z[i]=R-L+1).
  • (i\le R):
    • Let (k=i-L). We know that (Z[i]\ge \min (Z[k], R-i+1))
    • (Z[k]<R-i+1):
      • (Z[i]=Z[k])
    • (Z[k]\ge R-i+1):
      • Update ([L,R]) to ([i,R]) and start comparing (S[R+1…]) to (S[k+1…])
      L = 0
      R = 0
      for i = 1 to n:
      	if (i > R):
      		L = i
      		R = i
      		while (R < n && s[R - L] == s[R]):
      			R++
      		Z[i] = R - L
      		R--
      	else:
      		k = i - L
      		if (Z[k] < R - i + 1):
      			Z[i] = Z[k]
      		else:
      			L = i
      			while (R < n && s[R-L] == s[R]):
      				R++
      			Z[i] = R - L
      			R--
    

For pattern matching we use Z-algorithm on string (P\Phi T) where (\Phi) matches nothing Then indices with (Z[i]=m) correspond to matches in (T) of (P).