[KinoSearch] MultiSearcher's lack of features

Marvin Humphrey marvin at rectangular.com
Mon Jun 11 16:56:14 PDT 2007


On Jun 11, 2007, at 2:04 PM, Hans Dieter Pearcey wrote:

> I looked through the MultiSearcher source and saw the dire warnings
> about "the sort cache problem", but I don't know what that means  
> and the
> missing support for filters doesn't seem to be described at all.

Technically, MultiSearcher can handle Filters... it's SearchClient  
that can't.  The problem is the overhead of passing the cached  
BitVector in a Filter between client and server.  There's no fix  
pending, I'm afraid.

The "sort cache problem" is really a problem of collating search  
results from disparate searchers.  For this one, there is a fix  
available in theory, but it's costly in terms of memory.

In KS, the sort cache is an array of 32-bit integers, one per  
document, corresponding to the "term number" in the lexicon for the  
sort field.

If this is our doc set...

   my @docs = (
     { name => 'fido',     species => 'dog' }, # doc 0
     { name => 'spot',     species => 'dog' }, # doc 1
     { name => 'whiskers', species => 'cat' }, # doc 2
   );

... then the lexicon for the field 'species' will look like this...

   my @lexicon = (
      cat, # term 0
      dog, # term 1
   );

... and this will be our sort cache:

   my @sort_cache = ( 1, 1, 0 );

It's dirt cheap to sort docs by term because of the sort cache: just  
look up the term numbers for the two documents and compare.

    if ( $sort_cache[$doc_num_a] < $sort_cache[$doc_num_b] ) {
       ...
    }

The problem is that the central machine doesn't have access to the  
sort caches on the satellites, so when it comes time to select the  
creme de la creme from 5 sets of satellite top 10 hits, the central  
machine doesn't know how docs from different satellites compare.

The straightforward solution is to have the satellites send not just  
doc numbers to the central machine, but doc numbers accompanied by  
sort field values.

   # now:
   (  451, 20, 52 )
   # fixed:
   ( [ 451, 'cat' ], [ 20, 'dog' ], [ 52, 'human' ] )

Then the central machine could make string comparisons of the sort  
field values from different machines.

Looking up those values is kind of expensive though.  Lexicons are  
only partially kept in memory (1 out of every 128 terms), and with a  
multi-segment index, you have to perform 1 disk scan per segment.   
Say you have 10 hits and 25 segments.  That's 250 disk seeks to  
associate each hit with a sort field value.  :(

To avoid that cost, we might have to load entire lexicons for sort  
fields into memory.  I've been trying to avoid that, but I don't see  
how.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list