[KinoSearch] Queries with large number of hits.
Marvin Humphrey
marvin at rectangular.com
Tue Sep 16 23:36:20 PDT 2008
On Sep 14, 2008, at 10:02 PM, Nathan Kurz wrote:
> Taking Dan's tests at face value, for the moment, I'm not quite
> understanding how the issues you are pointing at would affect speed
> this much.
I don't think addressing those items would have that level of impact,
either. It's really, really easy to screw up these kind of
comparative benchmarks, though. Before I published the indexing
benchmarks, I submitted the Lucene app to the lucene dev list for
critiquing and even after all the grilling it got there we STILL
missed a crucial bug in it.
That said, I wouldn't surprise me if current Lucene search-time
performance exceeded that of KS trunk at least until the issues I
listed are addressed -- I just don't know by how much.
> It seems like his chosen terms can't be occurring so many
> times per document that the extra position decoding could be this
> significant.
The extra positional decoding is probably big enough to think about.
No way it could account for a fourfold discrepancy though. More like
5% - 20%.
> Is the Lucene position data kept in a separate stream?
Exactly.
The dev branch of KinoSearch implements the "flexible indexing" model
described at <http://wiki.apache.org/lucene-java/FlexibleIndexing>,
where doc number, frequency, positions, and boost all reside in one
unified file (per field). In contrast, each Lucene segment has...
* One .frq file which holds document number and term frequency info.
* One .prx file which holds positions data.
* One file per field holding boost data. These files are lazily
slurped into RAM as soon as they are needed and cached for the life
of the IndexReader.
We knew about the extra-positions-overhead problem from the start, but
we figured it would be enough if we gave people the option of
disabling positions on a per-field basis. My take now, having since
put flexible indexing into practice, is that ad-hoc disabling not a
practical approach. You need multiple streams.
> If creating a real benchmark (a good idea) seems too difficult,
We can get started with a benchmark for simple term queries against
the Reuters corpus.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list