May 5 2004 I first started looking into this, and then did some more research on my own. I want to clarify that what you see here and see below was done without really doing formal research into string matching -- I just had an idea, and wanted to see how far I could go with it. The above linked document contains much more detail, as well as methods for doing string matching, although most of it has already been discovered. I strongly recommend that those of you looking for fuzzy string matching algorithms to look up levenshtein distances. old stuff below:
introduction
My current project involves fuzzy pattern matching of strings. The idea is thattraditional string comparisons (such as c/c++'s strcmp) return only a boolean valuetrue or fals as to its equivalence. Given two strings, for example, "cemetery" and"cemetary" -- a common misspelling, strcmp would return false, while the casualobserver can fairly accurately discern the intended spelling.
I hesitate to call it true fuzzy logic, however. I'm just starting to get a gripon the concepts of fuzzy logic, but some of the principles are incorporated withinit.
The applications of this are fairly simple. I assume most word-processors use a spell checker based on fuzzy logic, to find and suggestion possible correct spellings.Compiler parsing and error correction could also implement a similar fuzzy matchingsystem to handle syntax errors. I imagine that this could be incorporated intoa dictionary lookup scheme using Tries.
For a given string of length n, one can perform a boolen match on each element on a one-to-one basis with the second string, returning 1 for a match, and 0 for a mismatch.Returning to the above example of cemetery and cemetary:
string 1 | c | e | m | e | t | e | r | y | |
string 2 | c | e | m | e | t | a | r | y | |
results | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
divided by n (the length of the string) yields a value of .875 -- 87.5% of the characters match.
This method only works in cases in which there is an error in which only a correct character is replaced by an incorrect one . It fails miserably when a character is omitted from a string:
string 1 | c | e | m | e | t | e | r | y | |
string 2 | e | m | e | t | a | r | y | ||
results | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
or when two characters are swapped:
string 1 | c | e | m | e | t | e | r | y | ||
string 2 | c | e | m | e | e | t | r | y | ||
results | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | = .75 |
We can improve on this by comparing characters relative to its current position.In my algorithm, not only do i examine the character at location n, but also atn-1 and n+1 if the first comparison fails. If there is a match, we can't justassign a true value, but some degree of truthfulness. My algorithm gives a 25% penalty on matches made at an offset of +1/-1. This increases the accuracyof the comparison considerably. The above example would yield:
string 1 | c | e | m | e | t | e | r | y | ||
string 2 | c | e | m | e | e | t | r | y | ||
results | 1 | 1 | 1 | 1 | .75 | .75 | 1 | 1 | = .9375 |
and a marked increase when a character is ommitted:
string 1 | c | e | m | e | t | e | r | y | ||
string 2 | e | m | e | t | e | r | y | |||
results | 0 | .75 | 75 | .75 | .75 | .75 | .75 | .75 | = .65625 |
Finally, the last method I have implemented is phonetic matching. In the comparisonof "cemetery" and "cemetary" we can easily see that the intended spelling was "cemetery"and that a simple error based on similar sounds in the English language was made. Myalgorithm tries to estimate how close a letter sounds to another. For example,a "c" may sound like a "s", or a "s" may sound like a "z". Ideally these valueswould be derived from statistical analysis of spelling errors, but I, especially atmy low level of expertise in this area, lack the desire to do so. In my algorithmI have assigned a value of .66 for the phonetical match between an A and an E, yielding:
string 1 | c | e | m | e | t | e | r | y | ||
string 2 | c | e | m | e | t | a | r | y | ||
results | 1 | 1 | 1 | 1 | 1 | .66 | 1 | 1 | = .9575 |
This method can also be extended to consonant sounds generanted w/ multiple consonants:"ff" for "gh", "ph" for "p", "wr" for "r".
Fuzzy Links:
If you would like a link to your site, drop me a lineand I would be more than happy to do so.
You can reach me via email : jtboehm@kent.edu
Your comments are greatly appreciated =)