[KinoSearch] Finding matching search terms

Marvin Humphrey marvin at rectangular.com
Sun Dec 30 22:29:46 PST 2007



On Sun, Dec 30, 2007 at 07:00:32PM -0800, colossus forbin wrote:
> I would like to determine if a user has mispelled a term and then
> suggest a corrected version. If I was dealing with a single term or
> was AND'ing the terms it would be much simpler to handle.

The best algorithm for spellchecking -- the one used by Google and lots of
other people -- actually doesn't use search results as its primary source of
input.  The way to handle this problem is to examine past search histories
and see how people corrected their queries.  If "ciclops" is often followed
by "cyclops" and the query morphing stops there, then "cyclops" is probably
the right term and should be suggested.

This really should be implemented as a CPAN project entirely distinct from
KinoSearch.  KS is an inverted indexer at its heart, but an inverted index
isn't what's needed; the magic is all in the preprocessing, and then it's a
dictionary lookup (probably via a hash a la Berkeley DB) for each term to
see if each it is associated with any "suggestions".

Though I know of this algo by word of mouth, I'm sure there are many
academic papers out there by now which could provide a recipe.  Then you
need a large corpus.  Those are hard to come by because of privacy concerns
(think AOL fiasco), but the Pirate Bay search query log might serve:
<http://thepiratebay.org/tor/3783572>.

Users of this theoretical library might either use a dictionary derived from
the Pirate Bay logs or might create their own by turning the module loose on
their own query logs.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list