[KinoSearch] Merging indexes, etc
Marvin Humphrey
marvin at rectangular.com
Fri Oct 20 11:11:15 PDT 2006
>> Adding skipTo() to SegTermDocs and MultiTermDocs is easy, and should
>> yield some improvement of speed on phrase queries right away.
>> However, more significant benefits will accrue when I port Lucene's
>> BooleanScorer2 and its dependencies. That's more work.
>
> Hmm. Are you able to quantify (intuitively is fine) what kind of
> improvements one can expect in search performance for keywords in
> general,
> and phrases specifically?
Single keyword search-times would not be affected. The trick of
skipTo() is that it eliminates possibilities when multiple queries
are combined.
For phrase queries and complex boolean queries... it's really hard to
say. It's saves CPU and doesn't really affect disk i/o. But for
something like the example I supplied where there's a required term
which is quite rare, the CPU savings can be significant.
>> At some point, index size becomes too great for any one machine to
>> handle gracefully. What needs to happen then is for documents to be
>> distributed around several machines on several indexes -- so you no
>> longer have one monolithic index. Each machine then searches against
>> its smaller index, the results are pooled, and there's something like
>> a runoff election to determine which documents get returned.
>>
>> KinoSearch does not yet have the infrastructure to support this, but
>> the design is out there and just needs to be implemented.
>
> This sounds interesting. Hazzard a guess: how long to implement this
> concept?
A couple weeks? Lucene has RemoteSearchable, MultiSearcher, and
ParallelMultiSearcher, which implement this model, more or less --
there are several TODO comments in ParallelMultiSearcher. Porting
those classes might be straightforward, or it might not be. I
haven't thought too hard about how to handle the inter-machine
communication or studied this problem in depth. It wasn't originally
a major concern of mine, because KS didn't need to scale that big to
do what I was trying to do. Now, though, I can see the potential,
and distributing search over several machines is something I'm very
interested in.
To be frank, I always figured this was the kind of feature someone
would eventually subsidize. If not with KinoSearch, then with Lucy.
That's one reason why I haven't prioritized it. The other is that I
have a vision for KinoSearch 0.20 and every moment that it isn't done
drives me crazy. That vision includes sorting, RangeFilter, rich
positions and a simpler, more elegant class structure and file
format. All of those have to be done at once. Distributed search
gets overlaid over the top, so it can be added at any time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list