[KinoSearch] MultiSearcher's lack of features
Henka
henka at cityweb.co.za
Tue Jun 12 08:36:07 PDT 2007
> First... if we go to full caching of sort fields, it's a memory hit,
> but not an unmanageable one. The full caching strategy is, in fact,
> EXACTLY how Lucene does things. I'd hoped to improve on it, but
> maybe that won't be possible this time.
Understood. I think you've done a bloody fine job anyway!
> Second... best practice for a busy search cluster would be to
> optimize the indexes on most nodes, so that each index contains a
> single segment. Then you're looking at 1 disk seek per hit, and
> they're all coming at the same time and are all concentrated in the
> same spot in the same file. OS caching will help some, and the fact
> that the disk head wouldn't have to travel far will also make those
> seeks comparatively inexpensive. Search performance will continue to
> be dominated by the time it takes to to score large numbers of
> matches for common terms. These lookups won't be a primary
> consideration.
OK - I was intending to always optimize the final indexes on each cluster
node.
> For busy search clusters that must keep indexes updated frequently,
> you should dedicate one machine to rapidly changing data, while all
> the rest handle older, stable data and stay optimized. The rapid-
> update index will necessarily be multi-segment, but if you keep it
> small, the costs should be manageable.
For the time being, I was thinking of a nice simple approach: the search
node indexes are always optimized and are 'refreshed' (ie, overwritten -
quickly, during the graveyard shift) whenever the indexing cycle dictates
(once a week, month, whatever) - indexing and merging occuring outside the
search nodes on indexing cluster nodes.
However, I like your idea of having (a) seperate search node(s) with data
which is in flux and un-optimized... hell, this is turning out to be
oodles of fun! <rubbing hands>
>>> To avoid that cost, we might have to load entire lexicons for sort
>>> fields into memory. I've been trying to avoid that, but I don't see
>>> how.
>>
>> Just so I understand: when you say "load entire lexicons for sort
>> fields
>> into memory" you mean the sort fields of the -search result set-,
>> right?
>
> Say you are sorting by 'date'.
>
> At present, we keep 1 out of every 128 values for the date field in
> memory -- the contents of the .lexx file (LEXicon indeX). When we
> need to find a particular value, we look it up in this index, which
> tells us the general location on disk. Then we scan a small portion
> the full .lex file to find the exact term.
>
> What we might do instead is load ALL 'date' values into memory (the
> full .lex file) -- then we wouldn't need to touch the disk again.
> The memory costs of doing this depend on how many unique dates you have.
So far, about 30 million. Even quadruple that, sorting by float - spread
across multiple beefy nodes, things are fine. This means my initial
knee-jerk was unfounded.
Thanks for the detailed explanation.
Henry
More information about the KinoSearch
mailing list