[KinoSearch] Merging indexes, etc

Marvin Humphrey marvin at rectangular.com
Fri Oct 20 11:11:15 PDT 2006


>> Adding skipTo() to SegTermDocs and MultiTermDocs is easy, and should
>> yield some improvement of speed on phrase queries right away.
>> However, more significant benefits will accrue when I port Lucene's
>> BooleanScorer2 and its dependencies.  That's more work.
>
> Hmm.  Are you able to quantify (intuitively is fine) what kind of
> improvements one can expect in search performance for keywords in  
> general,
> and phrases specifically?

Single keyword search-times would not be affected.  The trick of  
skipTo() is that it eliminates possibilities when multiple queries  
are combined.

For phrase queries and complex boolean queries... it's really hard to  
say.  It's saves CPU and doesn't really affect disk i/o.  But for  
something like the example I supplied where there's a required term  
which is quite rare, the CPU savings can be significant.

>> At some point, index size becomes too great for any one machine to
>> handle gracefully.   What needs to happen then is for documents to be
>> distributed around several machines on several indexes -- so you no
>> longer have one monolithic index.  Each machine then searches against
>> its smaller index, the results are pooled, and there's something like
>> a runoff election to determine which documents get returned.
>>
>> KinoSearch does not yet have the infrastructure to support this, but
>> the design is out there and just needs to be implemented.
>
> This sounds interesting.  Hazzard a guess:  how long to implement this
> concept?

A couple weeks?  Lucene has RemoteSearchable, MultiSearcher, and  
ParallelMultiSearcher, which implement this model, more or less --  
there are several TODO comments in ParallelMultiSearcher.  Porting  
those classes might be straightforward, or it might not be.  I  
haven't thought too hard about how to handle the inter-machine  
communication or studied this problem in depth.  It wasn't originally  
a major concern of mine, because KS didn't need to scale that big to  
do what I was trying to do.  Now, though, I can see the potential,  
and distributing search over several machines is something I'm very  
interested in.

To be frank, I always figured this was the kind of feature someone  
would eventually subsidize.  If not with KinoSearch, then with Lucy.   
That's one reason why I haven't prioritized it.  The other is that I  
have a vision for KinoSearch 0.20 and every moment that it isn't done  
drives me crazy.  That vision includes sorting, RangeFilter, rich  
positions and a simpler, more elegant class structure and file  
format.  All of those have to be done at once.  Distributed search  
gets overlaid over the top, so it can be added at any time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list