String Matching

 

String matching consists of  finding one or more generally, all of the occurrences of a pattern in a text. Finding a certain pattern in a text is a problem arises in text-editing programs and web "surfing". Here we study various fundamental text processing algorithms (CLR) for quickly performing string searching and string matching.

 

Problem Formulation

Given a text array, T[1 . . n], of n character and a pattern array, P[1 . . m], of m characters. The problem is to find an integer s, called valid shift where 0 £ s < n - m and T[s +1 . . . s + m] = P[1 . . m]. In other words, to find whether P in T i.e., where P is a substring of T. The elements of P and T are character drawn from some finite alphabet such as {0, 1} or {A, B, . . Z, a, b, . . . , z}.

 

Given a string T[1 . . n], the substrings is define as T[i . . j] for some 0 £ i £  j £ n -1, that is, the string formed by the characters in T from index i to index j, inclusive. This means that a string is a substring of itself (simply take i = 0 and j = m).

The proper substring, of string T[1 . . n] is T[i . . j] for some 0< i £  j £ n -1. That is, we must have either i > 0 or j < m -1.

 

Using these definition, we can say given any string T[1 . . n], the substrings are

T[i . . j] = T[i] T[i + 1] T[i + 2] . . . T[j]    for some 0 £ i £  j £ n -1.

And proper substrings are

T[i . . j] = T[i] T[i + 1] T[i + 2] . . . T[j]    for some 0 < i £  j £ n -1.

Note that if i > j, then T[i . . j] is equal to the empty string or null, which has length zero.

 

 

Using these notations, we can define of a given string T[1 . . n] as T[0 . . i] for some 0 £ i  £ n -1 and suffix of a given string T[1- n] as T[i . . n -1] for some  0 £ i  £ n -1.

 

For example, given string T[1 . . B] is

T:     a     b     c     a     b     a     a     b     c     a     b     a     c

then a proper substring of T is

a     b     a     a

a prefix string of T is

a     b     c

and a suffix string of T is

b     a     c

 

Suppose we given string w, x, y and z

w:     a     b     c

x:      a     b     c     a

y:      a     b     c     a     b

z:      a     b     c     a     b     a     a     b

 

Here

x is a substring of z.

y is a substring of z.

Since |x| £ |y| therefore, we have
 

  x is a substring of y

 ê y: a b c a b
 ê x: a b c a

 

In this case,

Since |y| ≥ |x|

therefore x is a substring of y

y:     a     b     c     a     b

x:     a     b     c     a

 

In this case, since

|x| = |w|

therefore, x = y

x  = a     b     c

w = a     b     c