[KinoSearch] fuzzy searches

Marvin Humphrey marvin at rectangular.com
Mon Mar 15 13:05:53 PDT 2010


On Mon, Mar 15, 2010 at 03:18:12PM +0100, Nick Wellnhofer wrote:
> On 15.03.2010 06:15, Marvin Humphrey wrote:
> > LSI/LSA (Latent Semantic Indexing/Analysis, "LSA" seems to have become more
> > common) fell out of patent a couple of years ago.  The matrix algebra needed
> > to perform the data reduction is heavy-duty math, beyond my capabilities.  But
> > it sure is interesting to think about it in terms of vector space clustering.
> 
> There are also more approaches than LSA. But internally, KinoSearch only
> has to work with the "topic" (or "concept") vectors of each document and
> could support different pluggable models to compute those vectors from
> the term-document matrix.
> 
> If anyone is interested in working on something like that I would gladly
> contribute. 

That would be great.  :)

Lucene has a MoreLikeThisQuery implementation in contrib/ but it produces very
noisy results.  I have an idea for how to improve it which involves
clustering.  

It may be necessary to add an indexing component which writes topic vectors,
or it may be possible to achieve using existing data structures.  It would be
interesting to talk the idea through and find out.

We should probably take this to lucy-dev.

    http://lucene.apache.org/lucy/mailing_lists.html#Developers

Marvin Humphrey




More information about the kinosearch mailing list