[KinoSearch] Re: Wildcards

Marvin Humphrey marvin at rectangular.com
Tue Feb 26 20:30:05 PST 2008


On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote:

> I’ve written some code that follows approach #1 above, namely, it  
> iterates through the posting lists one after the other, keeping a  
> list of doc nums that have been seen. It counts them afterwards, to  
> get an accurate ‘doc_freq’.

PostingList objects have a get_doc_freq() method, so you can just do  
this:

   my $doc_freq = 0;
   $doc_freq += $_->get_doc_freq for @posting_lists;

> Is this something you would be willing to include in core, so I  
> don’t have to repeat it in multiple subclasses?


That approach, which is the one used in  
KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good  
option -- it's just makes for the simplest and shortest code sample.

First, you lose all the information other than document numbers.  When  
iterating over a PostingList, you'd typically want to access info like  
the number of times the term appears in the document.  Here's a crude,  
unweighted scoring technique using TF:

   my @hits;
   while ( $posting_list->next ) {
      my $posting = $posting_list->get_posting;
      push @hits, {
         doc_num => $posting->get_doc_num,
         freq    => $posting->get_freq,
      };
   }
   @hits = sort { $b->{freq} <=> $a->{freq} } @hits;

That snippet uses Perl APIs which aren't public yet, but will be  
eventually.  It would be short-sighted to put an API into core that  
yields only doc numbers and discards other info.

The more flexible option would be to provide a CompositePostingList  
class which interleaves the iterators of disparate PostingList objects  
using a priority queue.  ORScorer more or less works this way; it adds  
up scores in a way that's analogous to how a CompositePostingList  
object would add up term freq.

If you snoop the code for ORScorer (and its support class  
ScorerDocQueue), though, you'll see that it's a little involved.  It  
would be overkill to add a large, complex CompositePostingList class  
to KS right now, in order to avoid short-term code duplication.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the kinosearch mailing list