[KinoSearch] more abstract interfaces to kinosearch

Marvin Humphrey marvin at rectangular.com
Mon Jul 2 13:38:14 PDT 2007


On Jul 2, 2007, at 12:32 PM, Nathan Kurz wrote:

> To my mind, the main disadvantage of filters is that they happen too
> late, after all the rest of the query has already run.

Interesting point.  I suppose we added a filter_bits argument to  
Scorer_Collect, we could apply the filter earlier.  Roughly, replace  
this...

    do {
      ...
    } while (Scorer_Next(self));

... with this:

    do {
      ...
    } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));

The downside is that it takes the filtering logic out of HitCollector  
and puts it in Scorer.  HitCollector is sparse; Scorer has a lot more  
going on.

> Contrast this with an ANDScorer with ordered subclauses
> 'category:music' and 'text:"the-the"'.  Checking whether the category
> field contains the term 'music' is really fast and efficient, and if
> the music category is relatively small we only have to do the
> expensive part of the query on a very small subset of the documents.

The advantage of implementing this with a QueryFilter is that instead  
of combining the results from two PostingList objects, you're  
combining a PostingList with a BitVector.

PostingList objects are iterators that refill from disk; BitVectors  
are memory only.  Few objects refilling from disk means fewer disk  
seeks means better performance.

PList_Skip_To helps to optimize the case where you have one common  
term and one rare term -- the larger PostingList will spend most of  
its time skipping, which is cheaper than calling PList_Next over and  
over.  But it's still probably faster to have fewer PostingLists and  
more BitVectors.

>
> Sure, you say, but what about restricted to ranges?  Surely you need a
> filter for that? Currently, but only because it's hard to write a
> custom scorer.  I don't see any inherent limitation that would prevent
> a custom scorer from dealing with ranges directly. And if one wanted,
> I don't see any reason that its Next doc method has to use an index at
> all --- it could work from a cached bit vector just as well.

Yeah, that's true.  :)  A TermScorer just iterates over a  
PostingList, spitting back document numbers (and derived scores).   
You could just as easily feed it from a bitvector using  
BitVec_Next_Set_Bit (which is in 0.15 but got zapped as unnecessary  
in 0.20).  Such a scorer would have to either supply a fixed score or  
no score, since the BitVector would only store document number info.

> are there things that can currently done with the Filter apparatus  
> that
> couldn't be done with a Scorer as or more efficiently?

Filters are going to be faster in some cases because of the caching.   
However, BitVectors don't work well remotely, since they're too big  
to pass around -- that's why SearchClient doesn't support filtering.   
So Filters have their drawbacks, too.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list