[KinoSearch] Queries with large number of hits.
Marvin Humphrey
marvin at rectangular.com
Thu Sep 18 21:25:52 PDT 2008
On Sep 17, 2008, at 1:16 PM, Nathan Kurz wrote:
> But since the processing time is
> probably close to proportional to the file size, maybe this is where
> Lucene has the advantage.
I've now performed the experiment pitting Lucene's TermQuery against
KinoSearch's. The results were worse than expected, and extra
positions decoding overhead seem to be a bigger factor than I'd
thought it would be. For a somewhat common term ("Reuters" in the
Reuters corpus), KS is 3x slower. For a very common term ("the"),
it's around 6x slower, and that's worse than it sounds because the 6x
factor is multiplying a number that's big to begin with.
Proliferating positions account for the non-constant slowdown: "the"
occurs at many positions, and thus has greater per-document scanning
cost in comparison to "Reuters", which most likely appears once per
document.
> Re-indexing KinoSearch without positions and re-running your previous
> searches would also be an inverse way to test this hypothesis.
The way to do this in KS is to override FieldSpec->posting() in a
subclass so that it returns a MatchPosting instead of a ScorePosting.
MatchPosting was a placeholder until today, but now a provisional
implementation has been completed for testing purposes.
Happily, with MatchPosting, we get much closer to Lucene performance:
around 1.3x for "Reuters" and around 1.7x for "the".
MatchPostingScorer doesn't presently take document/field boost into
account, so it's doing a little less work than Lucene's TermScorer,
but nevertheless, the results are instructive.
Since we're still slower than Lucene even with MatchPosting, we may
want to work on optimizing TermScorer. Probably we need some sort of
bulk read -- a la what Lucene does, see below. We may also want to
restructure things so that we're reading at the level of PostingList
rather than Posting to cut down on method call overhead from nesting.
Right now, Lucene's TermScorer.next() calls TermDocs.read(int[],int[])
once every 20 calls. In contrast, KinoSearch's TermScorer_Next calls
PList_Next() and PList_Get_Posting() every time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
/** Advances to the next document matching the query.
* <br>The iterator over the matching documents is buffered using
* {@link TermDocs#read(int[],int[])}.
* @return true iff there is another document matching the query.
*/
public boolean next() throws IOException {
pointer++;
if (pointer >= pointerMax) {
pointerMax = termDocs.read(docs, freqs); // refill buffer
if (pointerMax != 0) {
pointer = 0;
} else {
termDocs.close(); // close stream
doc = Integer.MAX_VALUE; // set to sentinel
value
return false;
}
}
doc = docs[pointer];
return true;
}
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list