String Matching
Formalization:
- Text: $T[1..n]$
- Pattern: $P[1..m]$ ($m\le n$)
- elements of $P$ and $T$ are characters drawn from a finite alphabet $\Sigma$.
- $\Sigma^*$ denotes the set of all finite-length string using $\Sigma$
- $\varepsilon\in \Sigma ^*$ emptry string
- $x y$ denotes concatenation of $x$ and $y$
- $w\sqsubset x$: $w$ is a prefix of $x$, i.e. $x=wy$ for some $y\in \Sigma^*$
- $w\sqsupset x$: $w$ is a suffix of $x$, i.e. $x=yw$ for some $y\in \Sigma^*$
- $P_k$ is the $k$-character prefix $P[q..k]$
Comparison of Algorithms:
Algorithm | Preprocessing time | Matching Time |
---|---|---|
Naive | 0 | $O((n-m+1)m)$ |
Rabin-Karp | $\Theta(m)$ | $O((n-m+1)m)$ |
Finite Automaton | $O(m\vert \Sigma\vert)$ | $\Theta(n)$ |
Knuth-Morris-Pratt | $\Theta(m)$ | $\Theta(n)$ |
Z-Algorithm | 0 | $\Theta(m+n)$ |
Rabin-Karp algorithm
Based on certain assumptions its average-case running time is better.
Let us assume $\Sigma={0,1,2,\ldots,d}$. Let $p$ denote the interpretation of $P[1..m]$ as a radix-d number. Similarly let $t_s$ denote the interpretatin of $T[s+1\ldots s+m]$ for $s\in {0,1,\dots n-m}$. Now $t_s=p\iff T[s+1\ldots s+m]=P[1..m]$.
We can compute $p$ in $\Theta(m)$ using Horner’s rule
$p=P[m]+10(P[m-1]+10(P[m-2]+\ldots))$
Similarly
$t_{s+1}=10(t_s-10^{m-1}T[s+1])+T[s+m+1]$
where we can calculate $10^{m-1}$ in $O(\lg m)$.
For string matching we work with $\mathbb Z_q$ where $q$ is a prime such that $dq$ fits inside a computer word. However $t_{s} \equiv p\pmod q$ doesn’t imply that $t_s=p$, but $t_s\not\equiv p\mod q\implies t_s\neq p$. So we additionaly check explicity string equality.
In worst case it could be $\Theta((n-m+1)m)$ when all shifts are valid.
In average case we expect only $c$ shifts to be valid, then matching time is $O((n-m+1)+cm)=O(n+m)$ plus time required to process spurious hits. We can expect the number of spurious hits to be $O(n/q)$ since $t_s\equiv p\pmod q$ with probability $1/q$. This makes the expected matching time to be $O(n)+O(m(\nu+n/q))$ where $\nu$ is number of valid shifts. This is $O(n)$ if $\nu=O(1)$ (expected number of valid shifts is small) and $q\ge m$ ($q$ larger than the length of pattern)
Finite Automata
Definition
A finite automaton $M(Q, q_0, A, \Sigma, \delta)$ where
- $Q$: states,
- $q_0\in Q$: start state,
- $A\subseteq Q$: accepting states,
- $\Sigma$: input alphabet
- $\delta:Q\times \Sigma\mapsto Q$: transition function
$M$ induces a function $\phi:\Sigma^*\mapsto Q$ (final-state function) such that $\phi(w)$ is the state $M$ ends up in after scanning the string $w$. $M$ accepts a string $w$ iff $\phi(w)\in A$.
$\phi(\varepsilon)=q_0$
$\phi(wa)=\delta(\phi(w), a)\text{ for }w\in\Sigma^*, a\in \Sigma$
String-matching automata
For $P$ we generate a string-matching automaton in a preprocessing step.
For this we define an auxilary function $\sigma:\Sigma^*\mapsto{0,1,\ldots,m}$, the suffix function corresponding to $P$. $\sigma(x)$ is the length of the longest prefix of $P$ that is also a suffix of $x$.
$\sigma(x)=\max{k: P_k \sqsupset x}$
We have $\sigma(x)=m\iff P\sqsupset x$, also $x\sqsubset y\implies \sigma(x)\le \sigma(y)$ from definition.
We define string-matching automaton as:
- $Q={0,1,\ldots,m}$, $q_0=0$, $A={m}$
- $\delta(q, a)=\sigma(P_qa)$ is the longest prefix of pattern $P$ that has matched the text string $T$ so far.
We consider most recently read characters of $T$. For $T_i=P_j$, $P_j\sqsupset T_i$.
Let $q=\phi(T_i)$. We design $\delta$ such that at $q$, $P_q\sqsupset T_i$ and $q=\sigma(T_i)$. Therefore $\phi(T_i)=\sigma(T_i)$.
If the automaton is in state $q$ and reads the next characters $T[i+1]=a$, then we want the transition to lead to the state corresponding to the longest prefix of $P$ that is a suffix of $T_ia$, i.e. $\sigma(T_ia)$.
Because $P_q$ is the longest prefix of $P$ that is a suffix of $T_i$ , i.e. $P_q=\sigma(T_i)$ we also have the longest prefix of $P$ that is a suffix of $T_ia$ is not only $\sigma(T_ia)$ but also $\sigma(P_qa)$.
- If $a=P[q+1]$, $\delta(q, a)=q+1$.
- If $a\neq P[q+1]$ we find a smaller prefix of $P$ that is also a suffix of $T_i$.
Kunth-Morris-Pratt Algorithm
Given a pattern $P[1..m]$, the prefix function for the pattern $P$ is the function $\pi:{1,2,\ldots, m}\mapsto{0,1,\ldots, m-1}$ such that
$\pi[q]=\max{k:k<q\wedge P_k\sqsupset P_q}$
i.e. $\pi[q]$ is the length of the longest prefix of $P$ that is a proper suffix of $P_q$.
For matching:
- Scan the text $T$ from left to right starting with $q=0$ (number of characters matched is $0$ initially):
- While $q>0$ and next character doesn’t match, i.e. $P[q+1]\neq T[i]$, keep doing $q\leftarrow\pi[q]$
- If next character matches, i.e. $P[q+1]=T[i]$, then $q\leftarrow q+1$
- If $q=m$, pattern is found, and set $q\leftarrow\pi[q]$
q = 0 for i = 0 to n - 1: while (k > 0 && T[i] != P[q]): q = pi[q-1] if (T[i] == P[q]): q++ if (q == m): # Pattern Found q = pi[q - 1] pi[q] = k
For computing prefix-function:
-
$\pi[1]\leftarrow0$ by definition
-
Scan the pattern $P$ from left to right starting from $q=2$ 2 (loop starts from 2nd character, 1-indexed) and $k=0$ (number of characters matched in 0 initially):
- While $k>0$ and next character doesn’t match, i.e. $P[k+1]\ne P[q]$, keep doing $k\leftarrow\pi[k]$
- If next character matches, i.e. $P[k+1]=P[q]$, then $k\leftarrow k+1$.
- Set $\pi[q]\leftarrow k$
pi[1] = 0 k = 0 for q = 1 to m - 1: while (k > 0 && P[k] != P[q]): k = pi[k-1] if (P[k] == P[q]): k++ pi[q] = k
Z-Algorithm
The algorithm produces an array $Z$ where $Z[i]$ denotes the length of the longest substring starting from $S[i]$ which is also a prefix of $S$. We refer to prefix-substrings as substrings which are also prefix.
As we iterate over the letters in the string we maintain an interval $[L, R]$ which is the interval with maximum $R$ such that $1\le L\le i\le R$ and $S[L\ldots R]$is a prefix-substring (if no such inteval exists $L=R=-1$)
Now suppose we have correct interval $[L, R]$ for $i-1$ and for all values of $Z$ upto $i-1$, we will compute $Z[i]$ and new $[L, R]$ by following steps:
- $i>R$:
- there is not a prefix-substring of $S$ that starts before $i$ and ends at $\ge i$. We would have taken that interval instead.
- Set $[L,R]$ to $[i,i]$ and start comparing $S[0…]$ to $S[i…]$ and $Z[i]=R-L+1$.
- $i\le R$:
- Let $k=i-L$. We know that $Z[i]\ge \min (Z[k], R-i+1)$
- $Z[k]<R-i+1$:
- $Z[i]=Z[k]$
- $Z[k]\ge R-i+1$:
- Update $[L,R]$ to $[i,R]$ and start comparing $S[R+1…]$ to $S[k+1…]$
L = 0 R = 0 for i = 1 to n: if (i > R): L = i R = i while (R < n && s[R - L] == s[R]): R++ Z[i] = R - L R-- else: k = i - L if (Z[k] < R - i + 1): Z[i] = Z[k] else: L = i while (R < n && s[R-L] == s[R]): R++ Z[i] = R - L R--
For pattern matching we use Z-algorithm on string $P\Phi T$ where $\Phi$ matches nothing Then indices with $Z[i]=m$ correspond to matches in $T$ of $P$.