[KinoSearch] filtering and tallying

Marvin Humphrey marvin at rectangular.com
Tue Jul 10 01:20:39 PDT 2007


On Jul 3, 2007, at 2:24 PM, Nathan Kurz wrote:

> On 7/2/07, Marvin Humphrey <marvin at rectangular.com> wrote:
>>     do {
>>       ...
>>     } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit 
>> (filter_bits) ));
>
> I was thinking more along the lines of something that would fit into
> the search infrastructure.  One could create a FilterScorer that would
> wrap an existing scorer. Something like (obviously not exact):
>
> $boolean_scorer = $query->scorer;
> $filter_scorer = new FilterScorer($boolean_scorer, @filters);
> $filter_scorer->collect();
>
> While advancing through documents, FilterScorer works like an
> AndScorer, but for scoring it only calls Tally on the internal
> BooleanScorer.  If one wanted, the actual filters could be
> BitVectorScorers that would use a cached BitVector rather than reading
> in Postings.

Sounds like an idea with potential.  I'm not sure you should expect  
huge gains over the existing Filter implementation, but I'd be happy  
to be surprised.

> how would a Scorer optimized for matching alone
> be different?  The main difference is that wouldn't have any need for
> Tally to be called.

There's that.  Another big gain would come from a custom Posting type.

   MatchPosting: <doc_delta>+
   ScorePosting: <doc_delta, freq, shared_boost, <position>+ >+
   RichPosting:  <doc_delta, freq, <position, boost>+ >+

MatchPosting files have less data in them than ScorePosting files.   
(: Or they would if they were actually done. :)

ScorePosting files have less data in them than RichPosting files.

Anyone who wants a brief definition of posting, see...

http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ 
IRTheory.html

See these pages for the initial brainstorming sessions:

   http://wiki.apache.org/lucene-java/FlexibleIndexing
   http://wiki.apache.org/lucene-java/ 
ConversationsBetweenDougMarvinAndGrant

The implementation is slightly different in KS than was envisioned  
for Lucene.  In Lucene, posting information for different fields  
resides in shared files.  In KS, postings files are broken up by  
field -- so there's only one Posting format per file.  Also, there's  
no Posting class in Lucene, and the Lucene version of PostingList is  
TermDocs, which isn't quite the same.  In Lucene, you subclass  
TermDocs to get more data per document.  In KS, you subclass Posting,  
but you still use a PostingList.

Eventually, the idea is that people like yourself will create custom  
Posting subclasses to supply custom Scorer subclasses with exactly  
the data they need and no more.

> I think the only problem with using the existing
> Scorers for matching is that they do a lot of unnecessary Tallying if
> the score is thrown away.
>
> So maybe rather than having Tally called as  part of the
> $scorer->collect() loop, it could be optionally called by the
> HitCollector.  A TopDocCollector would call Tally and use the score to
> order hits, but a SortCollector wouldn't have to.

Sure.

> HitCollector would
> be passed a Scorer instead of just a score.

FWIW, Tally doesn't exist in Lucene, which only uses scorer.doc() and  
scorer.score().  Tally is supposed to be the vessel which carries  
"your score and more!".  :)
>
> This was a late night idea, but it still seems to make sense to me  
> this morning.
> Does it make sense to you?
>
>> PostingList objects are iterators that refill from disk; BitVectors
>> are memory only.  Few objects refilling from disk means fewer disk
>> seeks means better performance.
>
> My intuition is that BitVectors are going to be a win only in very
> specific situations.  If you can fit your entire index in RAM, the
> system page cache is going to treat you well after the first access,
> no BitVector needed. If your index is _much_ larger than RAM, the
> BitVector is going to be unwieldily large too.  If your index is a
> size comparable to RAM, even at a 1:32 ratio your (multiple) cached
> BitVectors are going to be displacing index that would otherwise have
> been cached.
>
> I wonder if one would get more bang by optimizing the index structure
> (as you are already doing) than by resorting to BitVectors.   My
> personal impulse is to figure out how the indexes could be mmap'ed in
> and used directly, thus saving the processing time in reconstituting
> them and allowing for better shared resources across multiple search
> threads/processes.  My guess is that this could be a win over even the
> BitVectors.

Possible.  We'll just need to benchmark it.

>> However, BitVectors don't work well remotely, since they're too big
>> to pass around -- that's why SearchClient doesn't support filtering.
>
> This does strike me as a major disadvantage.  Almost everything else
> I've looked at in KinoSearch seems like it scales quite nicely to
> multiple machines.
> It would be nice to enable someone to use this for a new Google :).

Yes.  Google's search results are the the sum of many small indexes,  
and in another sense, the sum of many scoring algorithms.  The KS  
infrastructure should scale up so that at least large specialized  
search engines are possible.  For the scoring algorithms, we'll have  
to work together as a community.  :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list