String Matching

String matching consists of finding one or more generally, all of the occurrences of a pattern in a text. Finding a certain pattern in a text is a problem arises in text-editing programs and web "surfing". Here we study various fundamental text processing algorithms (CLR) for quickly performing string searching and string matching.

Problem Formulation

Given a text array, T[1 . . n], of n character and a pattern array, P[1 . . m], of m characters. The problem is to find an integer s, called valid shift where 0 Ł s < n - m and T[s +1 . . . s + m] = P[1 . . m]. In other words, to find whether P in T i.e., where P is a substring of T. The elements of P and T are character drawn from some finite alphabet such as {0, 1} or {A, B, . . Z, a, b, . . . , z}.

Given a string T[1 . . n], the substrings is define as T[i . . j] for some 0 Ł i Ł j Ł n -1, that is, the string formed by the characters in T from index i to index j, inclusive. This means that a string is a substring of itself (simply take i = 0 and j = m).

The proper substring, of string T[1 . . n] is T[i . . j] for some 0< i Ł j Ł n -1. That is, we must have either i > 0 or j < m -1.

Using these definition, we can say given any string T[1 . . n], the substrings are

T[i . . j] = T[i] T[i + 1] T[i + 2] . . . T[j] for some 0 Ł i Ł j Ł n -1.

And proper substrings are

T[i . . j] = T[i] T[i + 1] T[i + 2] . . . T[j] for some 0 < i Ł j Ł n -1.

Note that if i > j, then T[i . . j] is equal to the empty string or null, which has length zero.

Using these notations, we can define of a given string T[1 . . n] as T[0 . . i] for some 0 Ł i Ł n -1 and suffix of a given string T[1- n] as T[i . . n -1] for some 0 Ł i Ł n -1.

For example, given string T[1 . . B] is

T: a b c a b a a b c a b a c

then a proper substring of T is

a b a a

a prefix string of T is

a b c

and a suffix string of T is

b a c

Suppose we given string w, x, y and z

w:     a     b     c

x:      a     b     c     a

y:      a     b     c     a     b

z:      a     b     c     a     b     a     a     b

Here

x is a substring of z.

y is a substring of z.

Since |x| Ł |y| therefore, we have

x is a substring of y
ę y: a b c a b

ę x: a b c a

In this case,

Since |y| ≥ |x|

therefore x is a substring of y

y:     a     b     c     a     b

x:     a     b     c     a

In this case, since

|x| = |w|

therefore, x = y

x = a     b     c

w = a     b     c