Knuth-Morris-Pratt Algorithm

 

Knuth, Morris and Pratt discovered first linear time string-matching algorithm by following a tight analysis of the naïve algorithm. Knuth-Morris-Pratt algorithm keeps the information that naïve approach wasted gathered during the scan of the text. By avoiding this waste of information, it achieves a running time of O(n + m), which is optimal in the worst case sense. That is, in the worst case Knuth-Morris-Pratt algorithm we have to examine all the characters in the text and pattern at least once.

 

The Failure Function

The KMP algorithm preprocess the pattern P by computing a failure function f that indicates the largest possible shift s using previously performed comparisons. Specifically, the failure function f(j) is defined as the length of the longest prefix of P that is a suffix of P[i . . j].

 

KNUTH-MORRIS-PRATT FAILURE (P)

Input:    Pattern with m characters
Output: Failure function f for P[i . . j]

i 1
j 0
f(0) 0
while i < m do
    if P[j] = P[i]
        f(i) j +1
         i i +1
          jj + 1
else if
         j f(j - 1)
else
        f(i) 0
        i i +1

 

Note that the failure function f for P, which maps j to the length of the longest prefix of P that is a suffix of P[1 . . j], encodes repeated substrings inside the pattern itself.

As an example, consider the pattern P = a b a c a b. The failure function, f(j), using above algorithm is

 

j   a     1      2     3     4     5
P[j]   a      b     a     c     a      b
f(j)   0     0      1    0     1      2

 

By observing the above mapping we can see that the longest prefix of pattern, P, is "a b" which is also a suffix of pattern P.

Consider an attempt to match at position i, that is when the pattern P[0 . . m -1] is aligned with text P[i . . i + m -1].

 

T: a b a c a a b a c c
P: a b a c a b        

 

Assume that the first mismatch occurs between characters T[ i+ j] and P[j] for 0 < j < m. In the above example, the first mismatch is T[5] = a and P[5] = b.

Then, T[i . . i + j -1] = P[0 . . j -1] = u

That is,  T[ 0 . . 4] = P[0 . . 4] = u,  in the example [u = a b a c a] and

T[i + j] ≠ P[j] i.e., T[5] ≠  P[5],  In the example [T[5] = a ≠  b = P[5]].

 

When shifting, it is reasonable to expect that a prefix v of the pattern matches some suffix of the portion u of the text. In our example, u = a b a c a and v = a b a c a, therefore, 'a' a prefix of v matches with 'a' a suffix of u. Let l(j) be the length of the longest string P[0 . . j -1] of pattern that matches with text followed by a character c different from P[j]. Then after a shift, the comparisons can resume between characters T[i + j] and P[l(j)], i.e., T(5) and P(1)

 

T: a b a c a a b a c c
P:         a b a c a b

 

Note that no comparison between T[4] and P[1] needed here.

 


KNUTH-MORRIS-PRATT (T, P)

Input:    Strings T[0 . . n] and P[0 . . m]
Output:
Starting index of substring of T matching
P

f compute failure function of Pattern P
i 0
j 0
while i < length[T] do
    if j m-1 then
        return i- m+1    // we have a match
i i +1
j j +1
else if j > 0
        j f(j -1)
    else
        i i +1

 

Analysis

The running time of Knuth-Morris-Pratt algorithm is proportional to the time needed to read the characters in text and pattern. In other words, the worst-case running time of the algorithm is O(m + n) and it requires O(m) extra space. It is important to note that these quantities are independent of the size of the underlying alphabet.