[KinoSearch] Re: Wildcards
Marvin Humphrey
marvin at rectangular.com
Tue Feb 26 20:30:05 PST 2008
On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote:
> I’ve written some code that follows approach #1 above, namely, it
> iterates through the posting lists one after the other, keeping a
> list of doc nums that have been seen. It counts them afterwards, to
> get an accurate ‘doc_freq’.
PostingList objects have a get_doc_freq() method, so you can just do
this:
my $doc_freq = 0;
$doc_freq += $_->get_doc_freq for @posting_lists;
> Is this something you would be willing to include in core, so I
> don’t have to repeat it in multiple subclasses?
That approach, which is the one used in
KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good
option -- it's just makes for the simplest and shortest code sample.
First, you lose all the information other than document numbers. When
iterating over a PostingList, you'd typically want to access info like
the number of times the term appears in the document. Here's a crude,
unweighted scoring technique using TF:
my @hits;
while ( $posting_list->next ) {
my $posting = $posting_list->get_posting;
push @hits, {
doc_num => $posting->get_doc_num,
freq => $posting->get_freq,
};
}
@hits = sort { $b->{freq} <=> $a->{freq} } @hits;
That snippet uses Perl APIs which aren't public yet, but will be
eventually. It would be short-sighted to put an API into core that
yields only doc numbers and discards other info.
The more flexible option would be to provide a CompositePostingList
class which interleaves the iterators of disparate PostingList objects
using a priority queue. ORScorer more or less works this way; it adds
up scores in a way that's analogous to how a CompositePostingList
object would add up term freq.
If you snoop the code for ORScorer (and its support class
ScorerDocQueue), though, you'll see that it's a little involved. It
would be overkill to add a large, complex CompositePostingList class
to KS right now, in order to avoid short-term code duplication.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list