[KinoSearch] get doc/query similarity
jack_tanner at yahoo.com
jack_tanner at yahoo.com
Fri Apr 18 06:22:34 PDT 2008
> From: Marvin Humphrey <marvin at rectangular.com>
>
> In your case, though, my impression was that you were quite
> knowledgeable, but that your project did not need the devel branch
> badly enough to guarantee sustained momentum over the course of what
> would likely be a drawn-out design discussion.
Oh, I'm always up for a discussion. As to whether I'm knowledgeable enough, no idea. It may just seem that way. :) But I'm happy to try to help.
> Exposing similarity measures would be superficially easy -- all the
> relevant material is in KinoSearch::Search::Similarity. However, the
> actual APIs to interface with the math in Similarity are internal and
> not set up for use the way you described your needs. The bigger
> problems were how to get at "an indexed document", how to list its
> terms, and so on, outside of the context of the existing search API.
Right. How about something like this:
$doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1);
$doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2);
I like that this gets the doc from the invindex rather than a searcher. It makes clear that it returns a doc, not a hit. It either succeeds (we get *the* doc, not any other doc), or fails.
$similarity = $doc1->get_cosine($doc2);
And more generally,
$similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);
At indexing time, we probably do this:
$invindexer->spec_field(
name => 'doc_id',
analyzed => 0,
vectorized => 0,
indexed => 1,
);
On another note, here are my notes from implementing my doc/doc similarity code.
- As you point out, it'd be nice to get back an indexed document. I sidestepped this by recreating a document for each query from scratch. These were Boolean OR queries with lots of clauses.
- KS retrieval is asymmetrical (and that's fine). Let similarity(I,A,B) be a function that specifies document A as query against index I, iterates over the hits until it gets to document B, and returns the score of document B. Then similarity(I,A,B) != similarity(I,B,A). I handled this by retrieving both similarity(I,A,B) and similarity(I,B,A) and taking the average.
- One issue that still puzzles me is that KS is apparently capable of a hit score greater than 1! Is that really true?
- Here's a sample output:
Redundancy computation for author: 3600
textID 25 similar to 23 at 3.02506327629089
textID 25 similar to 24 at 3.00168991088867
textID 25 similar to 22 at 2.99010539054871
textID 25 similar to 21 at 1.60162734985352
textID 22 similar to 23 at 2.88369727134705
textID 22 similar to 25 at 2.82533693313599
textID 22 similar to 24 at 2.2787299156189
textID 22 similar to 21 at 1.63472175598145
textID 21 similar to 22 at 2.0984148979187
textID 21 similar to 25 at 1.85871315002441
textID 21 similar to 23 at 1.83884310722351
textID 21 similar to 24 at 1.51606929302216
textID 24 similar to 25 at 3.20844388008118
textID 24 similar to 23 at 2.80741047859192
textID 24 similar to 22 at 2.6320960521698
textID 24 similar to 21 at 1.36477994918823
textID 23 similar to 25 at 2.93027353286743
textID 23 similar to 22 at 2.9176025390625
textID 23 similar to 24 at 2.4804675579071
textID 23 similar to 21 at 1.44625544548035
author 3600 redundancy = 47.3403416872025 / 20 = 2.367017
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list