Return a list of the best ``good enough'' matches. word is a
sequence for which close matches are desired (typically a string),
and possibilities is a list of sequences against which to
match word (typically a list of strings).
Optional argument n (default 3) is the maximum number
of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in
the range [0, 1]. Possibilities that don't score at least that
similar to word are ignored.
The best (no more than n) matches among the possibilities are
returned in a list, sorted by similarity score, most similar first.
This is a flexible class for comparing pairs of sequences of any
type, so long as the sequence elements are hashable. The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name ``gestalt pattern matching.'' The idea is to find
the longest contiguous matching subsequence that contains no
``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
address junk). The same idea is then applied recursively to the
pieces of the sequences to the left and to the right of the matching
subsequence. This does not yield minimal edit sequences, but does
tend to yield matches that ``look right'' to people.
Timing: The basic Ratcliff-Obershelp algorithm is cubic
time in the worst case and quadratic time in the expected case.
SequenceMatcher is quadratic time for the worst case and has
expected-case behavior dependent in a complicated way on how many
elements the sequences have in common; best case time is linear.
See Also:
Pattern Matching: The Gestalt Approach
Discussion of a
similar algorithm by John W. Ratcliff and D. E. Metzener.
This was published in
Dr. Dobb's Journal in
July, 1988.