[KinoSearch] Scaling KinoSearch

Marvin Humphrey marvin at rectangular.com
Sun May 18 23:06:04 PDT 2008


On May 17, 2008, at 11:19 PM, Nathan Kurz wrote:

> My first conclusion was that it will be necessary to split the index
> between multiple machines.  Even if it was possible to put 1 TB of RAM
> in a machine, at a modern processors ~5-10 GB/s for main memory access
> it would still take far too long to scan through the posting list for
> a common word.

I don't know if this would help with the particular app you have in  
mind, but in certain cases there may be heuristics which can improve  
search-time efficiency at the cost of some accuracy.  For instance, if  
you have an absolute measure by which you can pre-rank documents --  
e.g. a "pagerank" score -- then you don't necessarily need to scan  
through the entire posting list at search time.  See the docs for  
Schema::pre_sort and Searcher's set_prune_factor() method.

> Disk access, at around 100 MB/s, is almost too slow
> to consider, although if the tail of infrequent words is long enough
> it might be possible to save RAM by keeping the smallest posting lists
> on disk.

But since the virtual memory system is going to cache that stuff  
anyhow, wouldn't you prefer to leave memory management to the OS?

> The easiest way to split the index is to split it horizontally into
> what Google calls 'shards':  the first million documents go to the
> first machine, the second million to the next machine, and so forth.
> Each machine returns the best matches from its subset of documents,
> and a central machine collates this list. Splitting the other way (by
> words) might have some small performance benefit, but it also seems
> much harder to implement.

It's probably also best to divide up machines by tasks.  The machines  
which are responsible for scanning through posting lists should  
probably not also be responsible for retrieving stored docs and  
applying highlighting.  The search machines should merely return  
document identifiers to the boss node(s), which will then request docs  
from the storage cluster.

The code for the worker nodes in the scoring cluster might look  
something like this:

     package MySegReader;
     use base qw( KinoSearch::Index::SegReader );

     sub make_doc_reader {
         my $self = shift;
         return PrimaryKeyOnlyDocReader->new(
             invindex => $self->get_invindex
         );
     }

     package MyIndexReader;
     use base qw( KinoSearch::Index::IndexReader );

     sub make_seg_reader {
         my $self = shift;
         return MySegReader->new( invindex => $self->get_invindex );
     }

     ...

     my $reader = MyIndexReader->open(
         invindex => MySchema->read('/path/to/invindex'),
     );
     my $searcher = KinoSearch::Searcher->new( reader => $reader );
     my $search_server = KinoSearch::Search::SearchServer->new(
         searchable => $searcher,
         port       => 7890,
         password   => 'opensesame',
     );
     $search_server->serve;

The boss node code might look something like this:

     package MyMultiSearcher;
     use base qw( KinoSearch::Search::MultiSearcher );

     our %doc_server;

     sub new {
         my $either = shift;
         my $self = $either->SUPER::new(@_);
         $doc_server{$$self} = DocServerCluster->new;
         return $self;
     }

     sub fetch_doc {
         my ( $self, $doc_num ) = @_;
         my $key = $self->SUPER::fetch_doc($doc_num)->{primary_key};
         return $doc_server{$$self}->fetch_doc($key);
     }

     ...

     my $schema = MySchema->new;
     for my $server_name (@server_names) {
         push @searchers, KinoSearch::Search::SearchClient->new(
             peer_address => "$server_name:$port",
             password     => $pass,
             schema       => $schema,
         );
     }
     my $multi_searcher = KinoSearch::Search::MultiSearcher->new(
         searchables => \@searchers,
         schema      => $schema,
     );

     ...

It's possible to improve that (there's an extra round trip per doc  
fetch ), but you get the idea.

> And the shard approach makes adding new
> documents to the index very easy.

This is where I think KS might be able to use the index-time  
equivalent of Searchable.  (Indexable? Nah. IndexBuilder?).   
InvIndexer would be the single-machine subclass.  MultiInvIndexer  
would be the analogue to MultiSearcher, farming out doc additions to  
various worker nodes...

Or maybe it's best to just let solutions arise on their own.   
Distributed indexing is easier than distributed searching.

> To keep the latency down, this may require hosting multiple smaller
> shards per machine.

That could work.  How about one shard per processor?  I think that  
might actually be a good way to exploit multiple processors -- better  
than ithreads (which KS doesn't do anyhow).

> Given multiple sockets and NUMA, one might even get a bigger win here
> (doubling memory bandwidth) by deft use of processor affinity. I don't
> understand this well, but here's an intro:
> [www.kernel.org/pub/linux/kernel/people/christoph/pmig/ 
> numamemory.pdf].

On first scan, it makes basic sense to me as another variation on the  
theme of locality of reference.

> if anyone sees obvious flaws in my reasoning
> I'd be glad to hear them.

There are some Searchable APIs that don't work with multiple machines  
as presently implemented.

First, QueryFilter objects can't be passed between nodes because  
they're implemented as cached BitVectors, which are too big to send  
over the network with each search.  To address this, you'd have to  
cook up special SearchServer subclasses that kept their own  
QueryFilter objects around and knew how to interpret special requests  
from the SearchClient.

Second, KinoSearch's present multi-machine sorting implementation  
isn't very well-done IMO.

Last, Searchable's collect() API, soon to go public...

   $searcher->collect( query => $query, collector => $hit_collector );

... is single-machine as well, because the HitCollector would have to  
be serialized and sent to each node.  Working around this limitation  
has to be done case-by-case, as it has with TopDocCollector and  
Searchable's top_docs() method.

None of these constitutes a fundamental flaw in your model.  The  
HitCollector limitation is permanent and the other two are just  
problems to be worked out.  With luck we can solve 'em as elegantly as  
we've solved QueryParser.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the kinosearch mailing list