May 5 2004 I first started looking into this, and then did some more research on my own. I want to clarify that what you see here and see below was done without really doing formal research into string matching -- I just had an idea, and wanted to see how far I could go with it. The above linked document contains much more detail, as well as methods for doing string matching, although most of it has already been discovered. I strongly recommend that those of you looking for fuzzy string matching algorithms to look up levenshtein distances. old stuff below:

introduction

Fuzzy Logic is currently an area of interest to me. Right now I am sort of stumblingalong, which in all frankness is part of the fun. Traditional boolean logic statesthat either a fact is true or that it is false. While this sort of logic is fine formany applications, there are times when one may wish to know how true or how falseis a rule.

My current project involves fuzzy pattern matching of strings. The idea is thattraditional string comparisons (such as c/c++'s strcmp) return only a boolean valuetrue or fals as to its equivalence. Given two strings, for example, "cemetery" and"cemetary" -- a common misspelling, strcmp would return false, while the casualobserver can fairly accurately discern the intended spelling.

I hesitate to call it true fuzzy logic, however. I'm just starting to get a gripon the concepts of fuzzy logic, but some of the principles are incorporated withinit.

The applications of this are fairly simple. I assume most word-processors use a spell checker based on fuzzy logic, to find and suggestion possible correct spellings.Compiler parsing and error correction could also implement a similar fuzzy matchingsystem to handle syntax errors. I imagine that this could be incorporated intoa dictionary lookup scheme using Tries.

evaluation methodsThe naive method for comparing strings begins with the first element, and continuesuntil a.) the end of the string is reached, or b.) a difference is encountered.

For a given string of length n, one can perform a boolen match on each element on a one-to-one basis with the second string, returning 1 for a match, and 0 for a mismatch.Returning to the above example of cemetery and cemetary:

string 1cemetery
string 2cemetary

results11111011
summing up the individual results yields:

1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 7

divided by n (the length of the string) yields a value of .875 -- 87.5% of the characters match.

This method only works in cases in which there is an error in which only a correct character is replaced by an incorrect one . It fails miserably when a character is omitted from a string:

string 1cemetery
string 2emetary

results00000000

or when two characters are swapped:

string 1cemetery
string 2cemeetry

results11110011 = .75

We can improve on this by comparing characters relative to its current position.In my algorithm, not only do i examine the character at location n, but also atn-1 and n+1 if the first comparison fails. If there is a match, we can't justassign a true value, but some degree of truthfulness. My algorithm gives a 25% penalty on matches made at an offset of +1/-1. This increases the accuracyof the comparison considerably. The above example would yield:

string 1cemetery
string 2cemeetry

results1111.75.7511 = .9375

and a marked increase when a character is ommitted:

string 1cemetery
string 2emetery

results0.7575.75.75.75.75.75 = .65625

Finally, the last method I have implemented is phonetic matching. In the comparisonof "cemetery" and "cemetary" we can easily see that the intended spelling was "cemetery"and that a simple error based on similar sounds in the English language was made. Myalgorithm tries to estimate how close a letter sounds to another. For example,a "c" may sound like a "s", or a "s" may sound like a "z". Ideally these valueswould be derived from statistical analysis of spelling errors, but I, especially atmy low level of expertise in this area, lack the desire to do so. In my algorithmI have assigned a value of .66 for the phonetical match between an A and an E, yielding:

string 1cemetery
string 2cemetary

results11111.6611 = .9575

This method can also be extended to consonant sounds generanted w/ multiple consonants:"ff" for "gh", "ph" for "p", "wr" for "r".

about the script: I am currently working on a Javascript demonstration to demonstratethe performance of these methods, and the combination of them as well, whichwill very shortly be online, probably Saturday May 28th, 1999.

Fuzzy Links:

If you would like a link to your site, drop me a lineand I would be more than happy to do so.

You can reach me via email : jtboehm@kent.edu
Your comments are greatly appreciated =)

last updated : may 28 1999