[KinoSearch] Scaling KinoSearch
Marvin Humphrey
marvin at rectangular.com
Sun May 18 23:06:04 PDT 2008
On May 17, 2008, at 11:19 PM, Nathan Kurz wrote:
> My first conclusion was that it will be necessary to split the index
> between multiple machines. Even if it was possible to put 1 TB of RAM
> in a machine, at a modern processors ~5-10 GB/s for main memory access
> it would still take far too long to scan through the posting list for
> a common word.
I don't know if this would help with the particular app you have in
mind, but in certain cases there may be heuristics which can improve
search-time efficiency at the cost of some accuracy. For instance, if
you have an absolute measure by which you can pre-rank documents --
e.g. a "pagerank" score -- then you don't necessarily need to scan
through the entire posting list at search time. See the docs for
Schema::pre_sort and Searcher's set_prune_factor() method.
> Disk access, at around 100 MB/s, is almost too slow
> to consider, although if the tail of infrequent words is long enough
> it might be possible to save RAM by keeping the smallest posting lists
> on disk.
But since the virtual memory system is going to cache that stuff
anyhow, wouldn't you prefer to leave memory management to the OS?
> The easiest way to split the index is to split it horizontally into
> what Google calls 'shards': the first million documents go to the
> first machine, the second million to the next machine, and so forth.
> Each machine returns the best matches from its subset of documents,
> and a central machine collates this list. Splitting the other way (by
> words) might have some small performance benefit, but it also seems
> much harder to implement.
It's probably also best to divide up machines by tasks. The machines
which are responsible for scanning through posting lists should
probably not also be responsible for retrieving stored docs and
applying highlighting. The search machines should merely return
document identifiers to the boss node(s), which will then request docs
from the storage cluster.
The code for the worker nodes in the scoring cluster might look
something like this:
package MySegReader;
use base qw( KinoSearch::Index::SegReader );
sub make_doc_reader {
my $self = shift;
return PrimaryKeyOnlyDocReader->new(
invindex => $self->get_invindex
);
}
package MyIndexReader;
use base qw( KinoSearch::Index::IndexReader );
sub make_seg_reader {
my $self = shift;
return MySegReader->new( invindex => $self->get_invindex );
}
...
my $reader = MyIndexReader->open(
invindex => MySchema->read('/path/to/invindex'),
);
my $searcher = KinoSearch::Searcher->new( reader => $reader );
my $search_server = KinoSearch::Search::SearchServer->new(
searchable => $searcher,
port => 7890,
password => 'opensesame',
);
$search_server->serve;
The boss node code might look something like this:
package MyMultiSearcher;
use base qw( KinoSearch::Search::MultiSearcher );
our %doc_server;
sub new {
my $either = shift;
my $self = $either->SUPER::new(@_);
$doc_server{$$self} = DocServerCluster->new;
return $self;
}
sub fetch_doc {
my ( $self, $doc_num ) = @_;
my $key = $self->SUPER::fetch_doc($doc_num)->{primary_key};
return $doc_server{$$self}->fetch_doc($key);
}
...
my $schema = MySchema->new;
for my $server_name (@server_names) {
push @searchers, KinoSearch::Search::SearchClient->new(
peer_address => "$server_name:$port",
password => $pass,
schema => $schema,
);
}
my $multi_searcher = KinoSearch::Search::MultiSearcher->new(
searchables => \@searchers,
schema => $schema,
);
...
It's possible to improve that (there's an extra round trip per doc
fetch ), but you get the idea.
> And the shard approach makes adding new
> documents to the index very easy.
This is where I think KS might be able to use the index-time
equivalent of Searchable. (Indexable? Nah. IndexBuilder?).
InvIndexer would be the single-machine subclass. MultiInvIndexer
would be the analogue to MultiSearcher, farming out doc additions to
various worker nodes...
Or maybe it's best to just let solutions arise on their own.
Distributed indexing is easier than distributed searching.
> To keep the latency down, this may require hosting multiple smaller
> shards per machine.
That could work. How about one shard per processor? I think that
might actually be a good way to exploit multiple processors -- better
than ithreads (which KS doesn't do anyhow).
> Given multiple sockets and NUMA, one might even get a bigger win here
> (doubling memory bandwidth) by deft use of processor affinity. I don't
> understand this well, but here's an intro:
> [www.kernel.org/pub/linux/kernel/people/christoph/pmig/
> numamemory.pdf].
On first scan, it makes basic sense to me as another variation on the
theme of locality of reference.
> if anyone sees obvious flaws in my reasoning
> I'd be glad to hear them.
There are some Searchable APIs that don't work with multiple machines
as presently implemented.
First, QueryFilter objects can't be passed between nodes because
they're implemented as cached BitVectors, which are too big to send
over the network with each search. To address this, you'd have to
cook up special SearchServer subclasses that kept their own
QueryFilter objects around and knew how to interpret special requests
from the SearchClient.
Second, KinoSearch's present multi-machine sorting implementation
isn't very well-done IMO.
Last, Searchable's collect() API, soon to go public...
$searcher->collect( query => $query, collector => $hit_collector );
... is single-machine as well, because the HitCollector would have to
be serialized and sent to each node. Working around this limitation
has to be done case-by-case, as it has with TopDocCollector and
Searchable's top_docs() method.
None of these constitutes a fundamental flaw in your model. The
HitCollector limitation is permanent and the other two are just
problems to be worked out. With luck we can solve 'em as elegantly as
we've solved QueryParser.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list