[KinoSearch] MultiSearcher's lack of features

Marvin Humphrey marvin at rectangular.com
Tue Jun 12 07:45:57 PDT 2007


On Jun 12, 2007, at 2:28 AM, Henka wrote:

>> Looking up those values is kind of expensive though.  Lexicons are
>> only partially kept in memory (1 out of every 128 terms), and with a
>> multi-segment index, you have to perform 1 disk scan per segment.
>> Say you have 10 hits and 25 segments.  That's 250 disk seeks to
>> associate each hit with a sort field value.  :(
>
> Very bad indeed.  That would potentially murder search times on a busy
> search cluster, right?  (IO being the bottlenek)

I think I've overstated the problem.

First... if we go to full caching of sort fields, it's a memory hit,  
but not an unmanageable one.  The full caching strategy is, in fact,  
EXACTLY how Lucene does things.  I'd hoped to improve on it, but  
maybe that won't be possible this time.

Second... best practice for a busy search cluster would be to  
optimize the indexes on most nodes, so that each index contains a  
single segment.  Then you're looking at 1 disk seek per hit, and  
they're all coming at the same time and are all concentrated in the  
same spot in the same file.  OS caching will help some, and the fact  
that the disk head wouldn't have to travel far will also make those  
seeks comparatively inexpensive.  Search performance will continue to  
be dominated by the time it takes to to score large numbers of  
matches for common terms.  These lookups won't be a primary  
consideration.

For busy search clusters that must keep indexes updated frequently,  
you should dedicate one machine to rapidly changing data, while all  
the rest handle older, stable data and stay optimized.  The rapid- 
update index will necessarily be multi-segment, but if you keep it  
small, the costs should be manageable.

I intend to write KinoSearch::Docs::CookBook::ScalingUp describing  
this architecture.  (: But not this week. :)  Best practice will  
remain the same no matter what system we adopt to handle the sort  
caching issue.

>> To avoid that cost, we might have to load entire lexicons for sort
>> fields into memory.  I've been trying to avoid that, but I don't see
>> how.
>
> Just so I understand:  when you say "load entire lexicons for sort  
> fields
> into memory" you mean the sort fields of the -search result set-,  
> right?

Say you are sorting by 'date'.

At present, we keep 1 out of every 128 values for the date field in  
memory -- the contents of the .lexx file (LEXicon indeX).  When we  
need to find a particular value, we look it up in this index, which  
tells us the general location on disk.  Then we scan a small portion  
the full .lex file to find the exact term.

What we might do instead is load ALL 'date' values into memory (the  
full .lex file) -- then we wouldn't need to touch the disk again.   
The memory costs of doing this depend on how many unique dates you have.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list