Knuth, Morris and Pratt discovered first linear time string-matching algorithm
by following a tight analysis of the naïve algorithm. Knuth-Morris-Pratt algorithm
keeps the information that naïve approach
wasted gathered during the scan of the text. By avoiding this waste of
information, it achieves a running time of O(*n *+* m*), which is
optimal in the worst case sense. That is, in the worst case Knuth-Morris-Pratt algorithm
we have to examine all the characters in the text and pattern at least once.

The KMP algorithm preprocess the pattern
P by computing a failure
function *f* that indicates the largest possible shift
*s* using
previously performed comparisons. Specifically, the failure function
*f*(*j*)
is defined as the length of the longest prefix of P that is a suffix of
P[*i* . . *j*].

KNUTH-MORRIS-PRATT FAILURE (P)

Input:Pattern with m characters

Output:Failure functionffor P[i. .j]

i← 1j← 0f(0) ← 0

whilei<mdo

if P[j] = P[i]

f(i) ←j+1i←i+1j←j+ 1

else ifj←f(j- 1)

else

f(i) ← 0i←i+1

Note that the failure function *f*
for P, which maps
*j* to
the length of the longest prefix of P that is a suffix of
P[1 . .
*j*], encodes repeated substrings inside the pattern itself.

As an example, consider the pattern P = a b a c a b. The failure
function, *f*(*j*), using above algorithm is

j a 1 2 3 4 5 P[j] a b a c a b f(j) 0 0 1 0 1 2

By observing the above mapping we can see that the longest prefix of pattern, P, is "a b" which is also a suffix of pattern P.

Consider an attempt to match at position *
i*, that is when the pattern
P[0
. . *m *-1] is aligned with text
P[*i* . .* i *+* m *-1].

T: a b a c a a b a c c P: a b a c a b

Assume that the first mismatch occurs between characters T[ i+ j] and P[j] for 0 < j < m. In the above example, the first mismatch is T[5] = a and P[5] = b.

Then, T[

i. .i+j-1] = P[0 . .j-1] =uThat is, T[ 0 . . 4] =

P[0 . . 4] =u, in the example [u= a b a c a] and

T[i+j] ≠P[j] i.e.,T[5] ≠P[5], In the example [T[5] = a ≠ b =P[5]].

When shifting, it is reasonable to expect that a prefix
*v* of the
pattern matches some suffix of the portion *u* of the text. In our example,
*u* = a b a c a and *v* = a b a c a, therefore, 'a' a prefix of
*v*
matches with 'a' a suffix of
*u*. Let *l*(*j*) be the length of
the longest string P[0 . . j -1] of pattern that matches with text followed by a
character c different from
P[j]. Then after a shift, the comparisons can resume
between characters T[i + j] and
P[l(j)], i.e.,
T(5) and P(1)

T: a b a c a a b a c c P: a b a c a b

Note that no comparison between T[4] and P[1] needed here.

KNUTH-MORRIS-PRATT (T, P)

Input:StringsT[0 . .n] andP[0 . .m]Starting index of substring of

Output:TmatchingP

f← compute failure function of PatternPi← 0

j← 0

whilei< length[T] do

ifj←m-1 then

returni-m+1 // we have a matchi←i+1j←j+1

else ifj> 0

j←f(j-1)

else

i←i+1

The running time of Knuth-Morris-Pratt algorithm is proportional to the time
needed to read the characters in text and pattern. In other words, the
worst-case running time of the algorithm is O(*m *+* n*) and it
requires O(*m*) extra space. It is important to note that these
quantities are independent of the size of the underlying alphabet.