[KinoSearch] more abstract interfaces to kinosearch
Marvin Humphrey
marvin at rectangular.com
Mon Jul 2 13:38:14 PDT 2007
On Jul 2, 2007, at 12:32 PM, Nathan Kurz wrote:
> To my mind, the main disadvantage of filters is that they happen too
> late, after all the rest of the query has already run.
Interesting point. I suppose we added a filter_bits argument to
Scorer_Collect, we could apply the filter earlier. Roughly, replace
this...
do {
...
} while (Scorer_Next(self));
... with this:
do {
...
} while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));
The downside is that it takes the filtering logic out of HitCollector
and puts it in Scorer. HitCollector is sparse; Scorer has a lot more
going on.
> Contrast this with an ANDScorer with ordered subclauses
> 'category:music' and 'text:"the-the"'. Checking whether the category
> field contains the term 'music' is really fast and efficient, and if
> the music category is relatively small we only have to do the
> expensive part of the query on a very small subset of the documents.
The advantage of implementing this with a QueryFilter is that instead
of combining the results from two PostingList objects, you're
combining a PostingList with a BitVector.
PostingList objects are iterators that refill from disk; BitVectors
are memory only. Few objects refilling from disk means fewer disk
seeks means better performance.
PList_Skip_To helps to optimize the case where you have one common
term and one rare term -- the larger PostingList will spend most of
its time skipping, which is cheaper than calling PList_Next over and
over. But it's still probably faster to have fewer PostingLists and
more BitVectors.
>
> Sure, you say, but what about restricted to ranges? Surely you need a
> filter for that? Currently, but only because it's hard to write a
> custom scorer. I don't see any inherent limitation that would prevent
> a custom scorer from dealing with ranges directly. And if one wanted,
> I don't see any reason that its Next doc method has to use an index at
> all --- it could work from a cached bit vector just as well.
Yeah, that's true. :) A TermScorer just iterates over a
PostingList, spitting back document numbers (and derived scores).
You could just as easily feed it from a bitvector using
BitVec_Next_Set_Bit (which is in 0.15 but got zapped as unnecessary
in 0.20). Such a scorer would have to either supply a fixed score or
no score, since the BitVector would only store document number info.
> are there things that can currently done with the Filter apparatus
> that
> couldn't be done with a Scorer as or more efficiently?
Filters are going to be faster in some cases because of the caching.
However, BitVectors don't work well remotely, since they're too big
to pass around -- that's why SearchClient doesn't support filtering.
So Filters have their drawbacks, too.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list