[KinoSearch] Finding matching search terms
colossus forbin
colossus.forbin at gmail.com
Mon Dec 31 09:21:24 PST 2007
On Dec 30, 2007 10:29 PM, Marvin Humphrey <marvin at rectangular.com> wrote:
> On Sun, Dec 30, 2007 at 07:00:32PM -0800, colossus forbin wrote:
> > I would like to determine if a user has mispelled a term and then
> > suggest a corrected version. If I was dealing with a single term or
> > was AND'ing the terms it would be much simpler to handle.
>
> The best algorithm for spellchecking -- the one used by Google and lots of
> other people -- actually doesn't use search results as its primary source of
> input. The way to handle this problem is to examine past search histories
> and see how people corrected their queries. If "ciclops" is often followed
> by "cyclops" and the query morphing stops there, then "cyclops" is probably
> the right term and should be suggested.
>
> This really should be implemented as a CPAN project entirely distinct from
> KinoSearch. KS is an inverted indexer at its heart, but an inverted index
> isn't what's needed; the magic is all in the preprocessing, and then it's a
> dictionary lookup (probably via a hash a la Berkeley DB) for each term to
> see if each it is associated with any "suggestions".
>
> Though I know of this algo by word of mouth, I'm sure there are many
> academic papers out there by now which could provide a recipe. Then you
> need a large corpus. Those are hard to come by because of privacy concerns
> (think AOL fiasco), but the Pirate Bay search query log might serve:
> <http://thepiratebay.org/tor/3783572>.
>
> Users of this theoretical library might either use a dictionary derived from
> the Pirate Bay logs or might create their own by turning the module loose on
> their own query logs.
This approach would make sense for a large site that expects a large
set of search terms, but what about a small site expecting a limited
number of terms, such as a small ecommerce site with a limited number
of products. If a user misspells a product name, it would make sense
to not only offer a corrected spelling, but perhaps suggest a similar
product which is carried by the site. These actions would be done at
run-time so it would be important to know which terms did not
contribute to any hits.
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list