[KinoSearch] Distributed indexes

Marvin Humphrey marvin at rectangular.com
Tue Aug 29 19:08:54 PDT 2006




On Aug 28, 2006, at 4:04 PM, Simon Wistow wrote:

> We're currently looking at about 10 Tb of data

FYI, I don't know of anyone who's built a KS-based search engine on  
that scale.  I'd really like to see somebody make the attempt, but I  
can't say what will happen.  ;)

If the search-time performance turns out to be too slow, eventually  
KS will have a cluster system available.  Multiple machines would  
need to communicate with each other prior to the main search  
processing: each term needs to get an IDF in order for scoring to  
proceed, and you have to know the total size of the collection and  
the number of docs the term appears in to calculate it.  If a search  
requested 10 docs, each machine would produce 10, then the top 10  
scorers out of the aggregate group would be your final search  
results.  If you want to see how this can work, take a look at  
Lucene's MultiSearcher class.

Working up such a cluster system is not currently at the top of my to- 
do list.  (Finding steady work is, followed by FieldCache, Lucy, etc...)

> to index with approx 16 new updates a second which we'd like to merge
> within about 40 seconds so any help would be gratefully appreciated.

The only way you're going to be able to manage this is if a new  
option gets added to finish(), and even then it will be challenging.   
This would also hold true for Lucene, because the problem is the  
same: inverted index data structures are optimized for fast  
searching, not instant updates.

Merging of segments takes an unpredictable amount of time.  Usually  
it will be pretty quick, but every once in a while, when large  
segments get merged, it will take a while.

Furthermore, when you update an index, you need to open a new  
Searcher and warm up its caches.  That cost is much easier to  
predict, though, as the warm-up time tracks the size of the index  
closely.

Incremental indexing in KS/Lucene works something like this:

    Add a book, generate an index.

    Add another chapter, generate a second index.  Now you need to
      leaf through two indexes.

    Add another chapter, generate a third index.  Now you've got
      three indexes to look through each time you search.

    ...

Those indexes get merged periodically, and the interleaving process  
takes time proportional to the size of the segments being merged.

The more segments you have, the longer it takes to search, though the  
time is still much more dependent on absolute size of index than on  
number of segments.  Optimizing the index via...

    $invindexer->finish( optimize => 1 );

... reads all existing segments and dumps them into the one currently  
being written, which is costly at index-time but produces a single- 
segment index for fastest searching.  Ordinarily, KS tries to keep  
segments around in something approximating a Fibonacci series (as an  
aside, this is different from Lucene) -- the idea being to minimize  
the number of merges, while still providing good search-time  
performance.

What we would need to do is STOP KinoSearch from merging any existing  
segments into the segment currently in progress (that's the API  
that's missing).  Then every once in a while, you'd force-optimize  
the index -- probably off-line to a copy which gets swapped in when  
the process completes.  It's a reasonable strategy, but the number of  
updates in your spec is quite substantial.

Cheers,

Marvin Humphrey

--
I'm looking for a part time job.




_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list