[KinoSearch] get doc/query similarity
Marvin Humphrey
marvin at rectangular.com
Thu Apr 10 15:30:40 PDT 2008
On Apr 10, 2008, at 2:33 PM, jack_tanner at yahoo.com wrote:
> 1) I'd like to compute TF and IDF between a query and one specific
> indexed document. What's the best way to do that?
Hmm, IDF for a *query*, not just a term? A query could be a lot of
different things. To know the IDF, you have to know how many
documents the query matches. To do that for an arbitrary query, you
have to run a search. KinoSearch::Search::Similarity has a private
idf() method, but it works on terms, not arbitrary queries...
Let's assume you mean a term, for the sake of getting things started.
Let's also assume that you don't really mean "one specific document",
even though that's exactly what you said. :)
Here's some code that goes in that general direction: it prints out TF
for each document which matches a specific term. It requires svn
trunk and uses some private methods:
my $invindex = MySchema->open('/path/to/invindex');
my $reader = KinoSearch::Index::IndexReader->open(
invindex => $invindex,
);
my $posting_list = $reader->posting_list(
field => 'title',
term => 'foo',
);
my $sim = $invindex->get_schema->fetch_sim('title');
while ( my $doc_num = $posting_list->next ) {
my $doc = $reader->fetch_doc($doc_num);
my $posting = $posting_list->get_posting;
my $num_occurrences = $posting->get_freq;
my $tf = $sim->tf($freq);
print "'$doc->{title}' FREQ: $num_occurrences TF: $tf\n";
}
> P.S. FYI, I could not subscribe to this list, post messages, or
> apparently even e-mail marvin at rectangular directly from my
> hotmail account.
Interesting. I received your private email and wrote back. Maybe
hotmail is blocking rectangular.com or something. AOL blockaded me
once because the previous tenants on the Comcast IP block
rectangular.com got assigned to weren't good netizens.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list