[KinoSearch] MultiSearcher's lack of features
Marvin Humphrey
marvin at rectangular.com
Tue Jun 12 07:45:57 PDT 2007
On Jun 12, 2007, at 2:28 AM, Henka wrote:
>> Looking up those values is kind of expensive though. Lexicons are
>> only partially kept in memory (1 out of every 128 terms), and with a
>> multi-segment index, you have to perform 1 disk scan per segment.
>> Say you have 10 hits and 25 segments. That's 250 disk seeks to
>> associate each hit with a sort field value. :(
>
> Very bad indeed. That would potentially murder search times on a busy
> search cluster, right? (IO being the bottlenek)
I think I've overstated the problem.
First... if we go to full caching of sort fields, it's a memory hit,
but not an unmanageable one. The full caching strategy is, in fact,
EXACTLY how Lucene does things. I'd hoped to improve on it, but
maybe that won't be possible this time.
Second... best practice for a busy search cluster would be to
optimize the indexes on most nodes, so that each index contains a
single segment. Then you're looking at 1 disk seek per hit, and
they're all coming at the same time and are all concentrated in the
same spot in the same file. OS caching will help some, and the fact
that the disk head wouldn't have to travel far will also make those
seeks comparatively inexpensive. Search performance will continue to
be dominated by the time it takes to to score large numbers of
matches for common terms. These lookups won't be a primary
consideration.
For busy search clusters that must keep indexes updated frequently,
you should dedicate one machine to rapidly changing data, while all
the rest handle older, stable data and stay optimized. The rapid-
update index will necessarily be multi-segment, but if you keep it
small, the costs should be manageable.
I intend to write KinoSearch::Docs::CookBook::ScalingUp describing
this architecture. (: But not this week. :) Best practice will
remain the same no matter what system we adopt to handle the sort
caching issue.
>> To avoid that cost, we might have to load entire lexicons for sort
>> fields into memory. I've been trying to avoid that, but I don't see
>> how.
>
> Just so I understand: when you say "load entire lexicons for sort
> fields
> into memory" you mean the sort fields of the -search result set-,
> right?
Say you are sorting by 'date'.
At present, we keep 1 out of every 128 values for the date field in
memory -- the contents of the .lexx file (LEXicon indeX). When we
need to find a particular value, we look it up in this index, which
tells us the general location on disk. Then we scan a small portion
the full .lex file to find the exact term.
What we might do instead is load ALL 'date' values into memory (the
full .lex file) -- then we wouldn't need to touch the disk again.
The memory costs of doing this depend on how many unique dates you have.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list