[KinoSearch] get doc/query similarity
jack_tanner at yahoo.com
jack_tanner at yahoo.com
Tue Apr 15 21:33:28 PDT 2008
> From: Marvin Humphrey <marvin at rectangular.com>
>
> I've started a reply to this several times, then balled it up and
> ashcanned it. I understand what you want theoretically, and the
> document frequency and term frequency information is in the index and
> accessible at least via private APIS. The question is how to achieve
> whatever your end goal is efficiently and conveniently.
So, you're asking me why exactly I want to go shoot myself in the foot. :)
The setting is NOT a general IR application. I'm working with a very small corpus, and expensive operations are just fine with me.
This is a kind of an duplicate detection task. I have a corpus of documents written by a known, small set of authors. I want to rank the authors w.r.t. how much they repeat themselves. To do that, I want to take all docs written by the same author, compute their pairwise similarities, and then average those similarities. (Probably just take the mean.) I'm going to repeat this for all authors. At the end, I have a "repetitiveness" score for each author. This score is the actual end goal.
> The brute force way is to take the contents of a document or possibly
> a distillation of the contents and use that as your query, hand off to
> a Searcher and see what the search gives back. That gives you a bunch
> of docs, though -- not just one. You can constrain the search by
> adding a "primary key"-type requirement, though performance of such a
> search might be a concern with large indexes due to the way KS
> compiles its queries.
I can definitely do that, and then just loop over the hits until I get the doc of interest. The only problem is if the doc of interest is not retrieved at all... but then I can assign that a score of 0.
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list