String Matching
Formalization:
- Text: (T[1..n])
- Pattern: (P[1..m]) ((m\le n))
- elements of (P) and (T) are characters drawn from a finite alphabet (\Sigma).
- (\Sigma^*) denotes the set of all finite-length string using (\Sigma)
- (\varepsilon\in \Sigma ^*) emptry string
- (x y) denotes concatenation of (x) and (y)
- (w\sqsubset x): (w) is a prefix of (x), i.e. (x=wy) for some (y\in \Sigma^*)
- (w\sqsupset x): (w) is a suffix of (x), i.e. (x=yw) for some (y\in \Sigma^*)
- (P_k) is the (k)-character prefix (P[q..k])
Comparison of Algorithms:
| Algorithm | Preprocessing time | Matching Time |
|---|---|---|
| Naive | 0 | (O((n-m+1)m)) |
| Rabin-Karp | (\Theta(m)) | (O((n-m+1)m)) |
| Finite Automaton | (O(m\vert \Sigma\vert)) | (\Theta(n)) |
| Knuth-Morris-Pratt | (\Theta(m)) | (\Theta(n)) |
| Z-Algorithm | 0 | (\Theta(m+n)) |
Rabin-Karp algorithm
Based on certain assumptions its average-case running time is better.
Let us assume (\Sigma={0,1,2,\ldots,d}). Let (p) denote the interpretation of (P[1..m]) as a radix-d number. Similarly let (t_s) denote the interpretatin of (T[s+1\ldots s+m]) for (s\in {0,1,\dots n-m}). Now (t_s=p\iff T[s+1\ldots s+m]=P[1..m]).
We can compute (p) in (\Theta(m)) using Horner’s rule
(p=P[m]+10(P[m-1]+10(P[m-2]+\ldots)))
Similarly
(t_{s+1}=10(t_s-10^{m-1}T[s+1])+T[s+m+1])
where we can calculate (10^{m-1}) in (O(\lg m)).
For string matching we work with (\mathbb Z_q) where (q) is a prime such that (dq) fits inside a computer word. However (t_{s} \equiv p\pmod q) doesn’t imply that (t_s=p), but (t_s\not\equiv p\mod q\implies t_s\neq p). So we additionaly check explicity string equality.
In worst case it could be (\Theta((n-m+1)m)) when all shifts are valid.
In average case we expect only (c) shifts to be valid, then matching time is (O((n-m+1)+cm)=O(n+m)) plus time required to process spurious hits. We can expect the number of spurious hits to be (O(n/q)) since (t_s\equiv p\pmod q) with probability (1/q). This makes the expected matching time to be (O(n)+O(m(\nu+n/q))) where (\nu) is number of valid shifts. This is (O(n)) if (\nu=O(1)) (expected number of valid shifts is small) and (q\ge m) ((q) larger than the length of pattern)
Finite Automata
Definition
A finite automaton (M(Q, q_0, A, \Sigma, \delta)) where
- (Q): states,
- (q_0\in Q): start state,
- (A\subseteq Q): accepting states,
- (\Sigma): input alphabet
- (\delta:Q\times \Sigma\mapsto Q): transition function
(M) induces a function (\phi:\Sigma^*\mapsto Q) (final-state function) such that (\phi(w)) is the state (M) ends up in after scanning the string (w). (M) accepts a string (w) iff (\phi(w)\in A).
(\phi(\varepsilon)=q_0)
(\phi(wa)=\delta(\phi(w), a)\text{ for }w\in\Sigma^*, a\in \Sigma)
String-matching automata
For (P) we generate a string-matching automaton in a preprocessing step.
For this we define an auxilary function (\sigma:\Sigma^*\mapsto{0,1,\ldots,m}), the suffix function corresponding to (P). (\sigma(x)) is the length of the longest prefix of (P) that is also a suffix of (x).
(\sigma(x)=\max{k: P_k \sqsupset x})
We have (\sigma(x)=m\iff P\sqsupset x), also (x\sqsubset y\implies \sigma(x)\le \sigma(y)) from definition.
We define string-matching automaton as:
- (Q={0,1,\ldots,m}), (q_0=0), (A={m})
- (\delta(q, a)=\sigma(P_qa)) is the longest prefix of pattern (P) that has matched the text string (T) so far.
We consider most recently read characters of (T). For (T_i=P_j), (P_j\sqsupset T_i).
Let (q=\phi(T_i)). We design (\delta) such that at (q), (P_q\sqsupset T_i) and (q=\sigma(T_i)). Therefore (\phi(T_i)=\sigma(T_i)).
If the automaton is in state (q) and reads the next characters (T[i+1]=a), then we want the transition to lead to the state corresponding to the longest prefix of (P) that is a suffix of (T_ia), i.e. (\sigma(T_ia)).
Because (P_q) is the longest prefix of (P) that is a suffix of (T_i) , i.e. (P_q=\sigma(T_i)) we also have the longest prefix of (P) that is a suffix of (T_ia) is not only (\sigma(T_ia)) but also (\sigma(P_qa)).
- If (a=P[q+1]), (\delta(q, a)=q+1).
- If (a\neq P[q+1]) we find a smaller prefix of (P) that is also a suffix of (T_i).
Kunth-Morris-Pratt Algorithm
Given a pattern (P[1..m]), the prefix function for the pattern (P) is the function (\pi:{1,2,\ldots, m}\mapsto{0,1,\ldots, m-1}) such that
(\pi[q]=\max{k:k<q\wedge P_k\sqsupset P_q})
i.e. (\pi[q]) is the length of the longest prefix of (P) that is a proper suffix of (P_q).
For matching:
- Scan the text (T) from left to right starting with (q=0) (number of characters matched is (0) initially):
- While (q>0) and next character doesn’t match, i.e. (P[q+1]\neq T[i]), keep doing (q\leftarrow\pi[q])
- If next character matches, i.e. (P[q+1]=T[i]), then (q\leftarrow q+1)
- If (q=m), pattern is found, and set (q\leftarrow\pi[q])
q = 0 for i = 0 to n - 1: while (k > 0 && T[i] != P[q]): q = pi[q-1] if (T[i] == P[q]): q++ if (q == m): # Pattern Found q = pi[q - 1] pi[q] = k
For computing prefix-function:
-
(\pi[1]\leftarrow0) by definition
-
Scan the pattern (P) from left to right starting from (q=2) 2 (loop starts from 2nd character, 1-indexed) and (k=0) (number of characters matched in 0 initially):
- While (k>0) and next character doesn’t match, i.e. (P[k+1]\ne P[q]), keep doing (k\leftarrow\pi[k])
- If next character matches, i.e. (P[k+1]=P[q]), then (k\leftarrow k+1).
- Set (\pi[q]\leftarrow k)
pi[1] = 0 k = 0 for q = 1 to m - 1: while (k > 0 && P[k] != P[q]): k = pi[k-1] if (P[k] == P[q]): k++ pi[q] = k
Z-Algorithm
The algorithm produces an array (Z) where (Z[i]) denotes the length of the longest substring starting from (S[i]) which is also a prefix of (S). We refer to prefix-substrings as substrings which are also prefix.
As we iterate over the letters in the string we maintain an interval ([L, R]) which is the interval with maximum (R) such that (1\le L\le i\le R) and (S[L\ldots R])is a prefix-substring (if no such inteval exists (L=R=-1))
Now suppose we have correct interval ([L, R]) for (i-1) and for all values of (Z) upto (i-1), we will compute (Z[i]) and new ([L, R]) by following steps:
- (i>R):
- there is not a prefix-substring of (S) that starts before (i) and ends at (\ge i). We would have taken that interval instead.
- Set ([L,R]) to ([i,i]) and start comparing (S[0…]) to (S[i…]) and (Z[i]=R-L+1).
- (i\le R):
- Let (k=i-L). We know that (Z[i]\ge \min (Z[k], R-i+1))
- (Z[k]<R-i+1):
- (Z[i]=Z[k])
- (Z[k]\ge R-i+1):
- Update ([L,R]) to ([i,R]) and start comparing (S[R+1…]) to (S[k+1…])
L = 0 R = 0 for i = 1 to n: if (i > R): L = i R = i while (R < n && s[R - L] == s[R]): R++ Z[i] = R - L R-- else: k = i - L if (Z[k] < R - i + 1): Z[i] = Z[k] else: L = i while (R < n && s[R-L] == s[R]): R++ Z[i] = R - L R--
For pattern matching we use Z-algorithm on string (P\Phi T) where (\Phi) matches nothing Then indices with (Z[i]=m) correspond to matches in (T) of (P).