Formalization:

Text: $T[1..n]$
Pattern: $P[1..m]$ ($m\le n$)
elements of $P$ and $T$ are characters drawn from a finite alphabet $\Sigma$.
$\Sigma^*$ denotes the set of all finite-length string using $\Sigma$
$\varepsilon\in \Sigma ^*$ emptry string
$x y$ denotes concatenation of $x$ and $y$
$w\sqsubset x$: $w$ is a prefix of $x$, i.e. $x=wy$ for some $y\in \Sigma^*$
$w\sqsupset x$: $w$ is a suffix of $x$, i.e. $x=yw$ for some $y\in \Sigma^*$
$P_k$ is the $k$-character prefix $P[q..k]$

Comparison of Algorithms:

Algorithm	Preprocessing time	Matching Time
Naive	0	$O((n-m+1)m)$
Rabin-Karp	$\Theta(m)$	$O((n-m+1)m)$
Finite Automaton	$O(m\vert \Sigma\vert)$	$\Theta(n)$
Knuth-Morris-Pratt	$\Theta(m)$	$\Theta(n)$
Z-Algorithm	0	$\Theta(m+n)$

Rabin-Karp algorithm

Based on certain assumptions its average-case running time is better.

Let us assume $\Sigma={0,1,2,\ldots,d}$. Let $p$ denote the interpretation of $P[1..m]$ as a radix-d number. Similarly let $t_s$ denote the interpretatin of $T[s+1\ldots s+m]$ for $s\in {0,1,\dots n-m}$. Now $t_s=p\iff T[s+1\ldots s+m]=P[1..m]$.

We can compute $p$ in $\Theta(m)$ using Horner’s rule

$p=P[m]+10(P[m-1]+10(P[m-2]+\ldots))$

Similarly

$t_{s+1}=10(t_s-10^{m-1}T[s+1])+T[s+m+1]$

where we can calculate $10^{m-1}$ in $O(\lg m)$.

For string matching we work with $\mathbb Z_q$ where $q$ is a prime such that $dq$ fits inside a computer word. However $t_{s} \equiv p\pmod q$ doesn’t imply that $t_s=p$, but $t_s\not\equiv p\mod q\implies t_s\neq p$. So we additionaly check explicity string equality.

In worst case it could be $\Theta((n-m+1)m)$ when all shifts are valid.

In average case we expect only $c$ shifts to be valid, then matching time is $O((n-m+1)+cm)=O(n+m)$ plus time required to process spurious hits. We can expect the number of spurious hits to be $O(n/q)$ since $t_s\equiv p\pmod q$ with probability $1/q$. This makes the expected matching time to be $O(n)+O(m(\nu+n/q))$ where $\nu$ is number of valid shifts. This is $O(n)$ if $\nu=O(1)$ (expected number of valid shifts is small) and $q\ge m$ ($q$ larger than the length of pattern)

Finite Automata

Definition

A finite automaton $M(Q, q_0, A, \Sigma, \delta)$ where

$Q$: states,
$q_0\in Q$: start state,
$A\subseteq Q$: accepting states,
$\Sigma$: input alphabet
$\delta:Q\times \Sigma\mapsto Q$: transition function

$M$ induces a function $\phi:\Sigma^*\mapsto Q$ (final-state function) such that $\phi(w)$ is the state $M$ ends up in after scanning the string $w$. $M$ accepts a string $w$ iff $\phi(w)\in A$.

$\phi(\varepsilon)=q_0$

$\phi(wa)=\delta(\phi(w), a)\text{ for }w\in\Sigma^*, a\in \Sigma$

String-matching automata

For $P$ we generate a string-matching automaton in a preprocessing step.

For this we define an auxilary function $\sigma:\Sigma^*\mapsto{0,1,\ldots,m}$, the suffix function corresponding to $P$. $\sigma(x)$ is the length of the longest prefix of $P$ that is also a suffix of $x$.

$\sigma(x)=\max{k: P_k \sqsupset x}$

We have $\sigma(x)=m\iff P\sqsupset x$, also $x\sqsubset y\implies \sigma(x)\le \sigma(y)$ from definition.

We define string-matching automaton as:

$Q={0,1,\ldots,m}$, $q_0=0$, $A={m}$
$\delta(q, a)=\sigma(P_qa)$ is the longest prefix of pattern $P$ that has matched the text string $T$ so far.

We consider most recently read characters of $T$. For $T_i=P_j$, $P_j\sqsupset T_i$.

Let $q=\phi(T_i)$. We design $\delta$ such that at $q$, $P_q\sqsupset T_i$ and $q=\sigma(T_i)$. Therefore $\phi(T_i)=\sigma(T_i)$.

If the automaton is in state $q$ and reads the next characters $T[i+1]=a$, then we want the transition to lead to the state corresponding to the longest prefix of $P$ that is a suffix of $T_ia$, i.e. $\sigma(T_ia)$.

Because $P_q$ is the longest prefix of $P$ that is a suffix of $T_i$ , i.e. $P_q=\sigma(T_i)$ we also have the longest prefix of $P$ that is a suffix of $T_ia$ is not only $\sigma(T_ia)$ but also $\sigma(P_qa)$.

If $a=P[q+1]$, $\delta(q, a)=q+1$.
If $a\neq P[q+1]$ we find a smaller prefix of $P$ that is also a suffix of $T_i$.

Kunth-Morris-Pratt Algorithm

Given a pattern $P[1..m]$, the prefix function for the pattern $P$ is the function $\pi:{1,2,\ldots, m}\mapsto{0,1,\ldots, m-1}$ such that

$\pi[q]=\max{k:k<q\wedge P_k\sqsupset P_q}$

i.e. $\pi[q]$ is the length of the longest prefix of $P$ that is a proper suffix of $P_q$.

For matching:

Scan the text $T$ from left to right starting with $q=0$ (number of characters matched is $0$ initially):
- While $q>0$ and next character doesn’t match, i.e. $P[q+1]\neq T[i]$, keep doing $q\leftarrow\pi[q]$
- If next character matches, i.e. $P[q+1]=T[i]$, then $q\leftarrow q+1$
- If $q=m$, pattern is found, and set $q\leftarrow\pi[q]$
```
  q = 0
  for i = 0 to n - 1:
  	while (k > 0 && T[i] != P[q]):
  		q = pi[q-1]
  	if (T[i] == P[q]):
  		q++
  	if (q == m):
  		# Pattern Found
  		q = pi[q - 1]
  	pi[q] = k
```

For computing prefix-function:

$\pi[1]\leftarrow0$ by definition
Scan the pattern $P$ from left to right starting from $q=2$ 2 (loop starts from 2nd character, 1-indexed) and $k=0$ (number of characters matched in 0 initially):
- While $k>0$ and next character doesn’t match, i.e. $P[k+1]\ne P[q]$, keep doing $k\leftarrow\pi[k]$
- If next character matches, i.e. $P[k+1]=P[q]$, then $k\leftarrow k+1$.
- Set $\pi[q]\leftarrow k$
```
  pi[1] = 0
  k = 0
  for q = 1 to m - 1:
  	while (k > 0 && P[k] != P[q]):
  		k = pi[k-1]
  	if (P[k] == P[q]):
  		k++
  	pi[q] = k
```

Z-Algorithm

The algorithm produces an array $Z$ where $Z[i]$ denotes the length of the longest substring starting from $S[i]$ which is also a prefix of $S$. We refer to prefix-substrings as substrings which are also prefix.

As we iterate over the letters in the string we maintain an interval $[L, R]$ which is the interval with maximum $R$ such that $1\le L\le i\le R$ and $S[L\ldots R]$is a prefix-substring (if no such inteval exists $L=R=-1$)

Now suppose we have correct interval $[L, R]$ for $i-1$ and for all values of $Z$ upto $i-1$, we will compute $Z[i]$ and new $[L, R]$ by following steps:

$i>R$:
- there is not a prefix-substring of $S$ that starts before $i$ and ends at $\ge i$. We would have taken that interval instead.
- Set $[L,R]$ to $[i,i]$ and start comparing $S[0…]$ to $S[i…]$ and $Z[i]=R-L+1$.

$i\le R$:

Let $k=i-L$. We know that $Z[i]\ge \min (Z[k], R-i+1)$
$Z[k]<R-i+1$:
- $Z[i]=Z[k]$
$Z[k]\ge R-i+1$:
- Update $[L,R]$ to $[i,R]$ and start comparing $S[R+1…]$ to $S[k+1…]$

  L = 0
  R = 0
  for i = 1 to n:
  	if (i > R):
  		L = i
  		R = i
  		while (R < n && s[R - L] == s[R]):
  			R++
  		Z[i] = R - L
  		R--
  	else:
  		k = i - L
  		if (Z[k] < R - i + 1):
  			Z[i] = Z[k]
  		else:
  			L = i
  			while (R < n && s[R-L] == s[R]):
  				R++
  			Z[i] = R - L
  			R--

For pattern matching we use Z-algorithm on string $P\Phi T$ where $\Phi$ matches nothing Then indices with $Z[i]=m$ correspond to matches in $T$ of $P$.

About	/	Home	/	programming	/	music	/	csl471
notes	/	machine learning	/	data structures	/	algorithms	/	chess
types and programming languages	/