[KinoSearch] Queries with large number of hits.
Nathan Kurz
nate at verse.com
Fri Sep 19 11:25:53 PDT 2008
On Thu, Sep 18, 2008 at 10:25 PM, Marvin Humphrey
<marvin at rectangular.com> wrote:
> I've now performed the experiment pitting Lucene's TermQuery against
> KinoSearch's. The results were worse than expected, and extra positions
> decoding overhead seem to be a bigger factor than I'd thought it would be.
Thanks for your quick research and improvements, Marvin. You prompted
me to run a few quick tests as well:
kinosearch_instream/perl$ search "the OR and OR of OR for" (ten times)
kinosearch_instream/perl$ opreport -alt2 */KinoSearch.so
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples cum. samples % cum. % symbol name
190 190 23.8394 23.8394 kino_InStream_read_c32
112 302 14.0527 37.8921 kino_ScorePost_read_record
68 370 8.5320 46.4241 kino_ScorerDocQ_top_next
55 425 6.9009 53.3250 kino_TermScorer_next
51 476 6.3990 59.7240 kino_InStream_read_u8
45 521 5.6462 65.3701 kino_SegPList_next
31 552 3.8896 69.2597 __i686.get_pc_thunk.bx
30 582 3.7641 73.0238 advance_after_current
29 611 3.6386 76.6625 anonymous symbol from section .plt
25 636 3.1368 79.7992 kino_MemMan_wrapped_realloc
22 658 2.7604 82.5596 kino_ScorePostScorer_tally
17 675 2.1330 84.6926 kino_ORScorer_tally
"opannotate --source */KinoSearch.so" is also useful to glance at, as
is adding the '-c' flag to opreport to see where the calls are coming
from and where the functions are spending their time internally.
The main thing that jumped out is that the function call to
Instream_read_c32 is killing us. I don't see any way to have this
remain a per-int function call and still get good performance. You
need to figure out some to decode the entire posting with fewer
function calls, and this is where the bulk of them are coming from.
I'd suggest having the Store level return an undecoded raw posting,
and let the Posting class decode the parts it wants. That way the
VByte code can be macros that work on a local buffer in a tight loop.
I'm sure there are other ways to do it, though.
The second thing that jumps out is that decompressing VBytes is
expensive. P4Delta might be a significant advantage, or perhaps there
are ways to optimize the decompression with the existing scheme. I
toyed a bit with trying to come up with a branchless way of doing it,
but gave up without much to show for it.
The third thing (tiny, but perhaps easy to fix) is that
Scorepost_read_record is spending 40% of its time in REALLOC. Is the
enlarged position buffer not getting reused for some reason?
The last thing (sort of a non-thing) is that to my surprise the
double/triple buffering of the Posting data doesn't seem to have much
of a negative effect. I still think it's worth trying to avoid this,
but it's barely on the current radar.
> The way to do this in KS is to override FieldSpec->posting() in a subclass
> so that it returns a MatchPosting instead of a ScorePosting. MatchPosting
> was a placeholder until today, but now a provisional implementation has been
> completed for testing purposes.
> Happily, with MatchPosting, we get much closer to Lucene performance: around
> 1.3x for "Reuters" and around 1.7x for "the".
Did you also recreate an index without the position information, or
are these times based on just skipping over the position info in the
index? Times would probably be closer if the information was
out-of-line.
Nathan Kurz
nate at verse.com
ps. The directions for building the Reuters benchmark index seem out
of date. '-Mblib' no longer finds the uninstalled KinoSearch.so in
the parent hierarchy.
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list