[KinoSearch] Queries with large number of hits.

Marvin Humphrey marvin at rectangular.com
Thu Sep 18 21:25:52 PDT 2008




On Sep 17, 2008, at 1:16 PM, Nathan Kurz wrote:

> But since the processing time is
> probably close to proportional to the file size, maybe this is where
> Lucene has the advantage.

I've now performed the experiment pitting Lucene's TermQuery against  
KinoSearch's.  The results were worse than expected, and extra  
positions decoding overhead seem to be a bigger factor than I'd  
thought it would be.  For a somewhat common term ("Reuters" in the  
Reuters corpus), KS is 3x slower.  For a very common term ("the"),  
it's around 6x slower, and that's worse than it sounds because the 6x  
factor is multiplying a number that's big to begin with.

Proliferating positions account for the non-constant slowdown: "the"  
occurs at many positions, and thus has greater per-document scanning  
cost in comparison to "Reuters", which most likely appears once per  
document.

> Re-indexing KinoSearch without positions and re-running your previous
> searches would also be an inverse way to test this hypothesis.


The way to do this in KS is to override FieldSpec->posting() in a  
subclass so that it returns a MatchPosting instead of a ScorePosting.   
MatchPosting was a placeholder until today, but now a provisional  
implementation has been completed for testing purposes.

Happily, with MatchPosting, we get much closer to Lucene performance:  
around 1.3x for "Reuters" and around 1.7x for "the".   
MatchPostingScorer doesn't presently take document/field boost into  
account, so it's doing a little less work than Lucene's TermScorer,  
but nevertheless, the results are instructive.

Since we're still slower than Lucene even with MatchPosting, we may  
want to work on optimizing TermScorer.  Probably we need some sort of  
bulk read -- a la what Lucene does, see below.  We may also want to  
restructure things so that we're reading at the level of PostingList  
rather than Posting to cut down on method call overhead from nesting.   
Right now, Lucene's TermScorer.next() calls TermDocs.read(int[],int[])  
once every 20 calls.  In contrast, KinoSearch's TermScorer_Next calls  
PList_Next() and PList_Get_Posting() every time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

   /** Advances to the next document matching the query.
    * <br>The iterator over the matching documents is buffered using
    * {@link TermDocs#read(int[],int[])}.
    * @return true iff there is another document matching the query.
    */
   public boolean next() throws IOException {
     pointer++;
     if (pointer >= pointerMax) {
       pointerMax = termDocs.read(docs, freqs);    // refill buffer
       if (pointerMax != 0) {
         pointer = 0;
       } else {
         termDocs.close();                         // close stream
         doc = Integer.MAX_VALUE;                  // set to sentinel  
value
         return false;
       }
     }
     doc = docs[pointer];
     return true;
   }


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list