[KinoSearch] get doc/query similarity

Nathan Kurz nate at verse.com
Tue Apr 15 12:19:40 PDT 2008



On Tue, Apr 15, 2008 at 8:13 AM,  <jack_tanner at yahoo.com> wrote:
> Ping? Still trying to compute similarity of two indexed docs... A weighted cosine or some such.

Hi Jack --

Like Marvin asks, it would help to know some more details about what
you hoping to do.   If you don't need exact control of the algorithm,
it sounds like you should be able to  generate a long Boolean query
based on the contents of the initial document, and let the default
TF/IDF scorer handle the details.

This is going to be a pretty expensive query, though, and depending on
your usage patterns you might want to precompute these.  Depending on
the overlap of your documents and how heavily you make use of
stop-words, presume you may have to sift through about half your
corpus, either from disk or memory depending on your situation.

If you need more control over the scoring, Marvin may try to convince
you to use the Index API's directly.  Don't let him get off this easy!
Having thought about this for a good three minutes :), I'm convinced
this is a good test for the flexibility of the KinoSearch
architecture.  With a couple custom Scorers, and a custom Collector if
you need all combinations, you should be able to make this work very
well.

That said, if speed is  a priority, if you need arbitrary matches, and
if you can't precompute the correlations, there might be some fancier
approaches to use.  For example, doing the scoring term by term rather
than doc by doc could save you a lot of function call overhead and be
several times faster.   This might require some more custom pieces,
though.  But I'm not sure which ones --- get Marvin to write that
overview doc of how all the pieces fit together.  ;)

Nathan Kurz
nate at verse.com

ps.  Marvin --- the term-by-term approach might be a useful general
optimization for a special purpose additive OrScorer.  It's a
speed-memory tradeoff:  instead of computing a final score for each
document and moving on, you allocate an array of scores with an entry
for each doc in your corpus.   For each term occurrence, you add a
partial score to the doc slot in the array.  Because this can be done
in a tight loop, this can be really fast, especially if the score
array can fit in L2 and if you read in the occurrence data non-cached.

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list