Boyer-Moore Algorithm

The Boyer-Moore algorithm is consider the most efficient string-matching algorithm in usual applications, for example, in text editors and commands substitutions. The reason is that it woks the fastest when the alphabet is moderately sized and the pattern is relatively long.

The algorithm scans the characters of the pattern from right to left beginning with the rightmost character. During the testing of a possible placement of pattern P against text T, a mismatch of text character T[i] = c with the corresponding pattern character P[j] is handled as follows: If c is not contained anywhere in P, then shift the pattern P completely past T[i]. Otherwise, shift P until an occurrence of character c in P gets aligned with T[i].

This technique likely to avoid lots of needless comparisons by significantly shifting pattern relative to text.

Last Function

We define a function last(c) that takes a character c from the alphabet and specifies how far may shift the pattern P if a character equal to c is found in the text that does not match the pattern

ě index of the last occurrence if c is in P

last(c) = í of c in pattern P

î -1 otherwise

For example consider

T: 0 1 2 3 4 5 6 7 8 9

a b a c a a b a c c

P: a b a c a b

0 1 2 3 4 5

last(a) is the index of the last (rightmost) occurrence of 'a' in P, which is 4.

last(c) is the index of the last occurrence of c in P, which is 3

'd' does not exist in the pattern there we have last (d) = -1.

c a b c d

last(c) 4 3 -1

Now, for 'b' notice

T: a b a c a a b a c c

P: a b a c a b

Therefore, last(b) is the index of last occurrence of b in P, which is 5

The complete last(c) function

c a b c d

last(c) 4 5 3 -1

Boyer-Moore algorithm

BOYER_MOORE_MATCHER (T, P)

Input:    Text with n characters and Pattern with m characters
Output: Index of the first substring of T matching P

Compute function last

i ← m-1

j ← m-1

Repeat

    If P[j] = T[i] then

        if j=0 then

            return i        // we have a match

        else

            i ← i -1

            j ← j -1

    else

        i ← i + m - Min(j, 1 + last[T[i]])

        j ← m -1

until i > n -1

Return "no match"

Analysis

The computation of the last function takes O(m+|∑|) time and actual search takes O(mn) time. Therefore the worst case running time of Boyer-Moore algorithm is O(nm + |∑|). Implies that the worst-case running time is quadratic, in case of n = m, the same as the naďve algorithm.

Boyer-Moore algorithm is extremely fast on large alphabet (relative to the length of the pattern).
The payoff is not as for binary strings or for very short patterns.
For binary strings Knuth-Morris-Pratt algorithm is recommended.
For the very shortest patterns, the naďve algorithm may be better.

	ě	index of the last occurrence	if c is in P
last(c) =	í	of c in pattern P
	î	-1	otherwise