Kwiki Logo

ScalingUp

(Not Logged In: Login via TypeKey)

As you add more and more documents to an index, at some point it grows large enough that one machine can't handle it. The solution is to distribute the document collection over multiple machines.

Let's imagine that our machines are really, really weak, and that we need three of them to handle the 52 html documents in the US Constitution sample corpus. Add the following lines to conf.pl:

num_nodes => 3, # number of machines in our search cluster
node_num  => 0, # assign this machine node number 0

At index time, we need to distribute the docs more or less evenly among the nodes. Assuming that the indexing processes are taking place on separate machines, each needs to be able to tell which docs belong to it. We can achieve this with a hashing algorithm in invindexer.plx.

use String::CRC32 qw( crc32 );

...

for my $filename (@filenames) {
    my $doc = slurp_and_parse_file($filename);
    my $index_assignment = crc32( $doc->{url} ) % $conf->{num_nodes};
    next unless $index_assignment == $conf->{node_num};
    $invindexer->add_doc($doc);
}

If we run invindexer.plx three times, each time with a different node_num, we'll index all the docs. Let's fake three machines by changing up node_num and index_path and creating numbered indexes – we don't actually want to set up a cluster just for the purposes of a tutorial.

# repeat with 0, 1, and 2.
node_num   => 0,
index_path => '/path/to/index0',

At search time, we'll use MultiSearcher to aggregate results. Create three sub-searchers and wrap a MultiSearcher around them:

my @subsearchers;
for my $node_num ( 0 .. $conf{num_nodes} ) {
    push @subsearchers, KinoSearch::Searcher->new(
        invindex => USConSchema->read("/path/to/index$node_num"),
    );
}
my $searcher = KinoSearch::Search::MultiSearcher->new(
    schema      => USConSchema->new,
    searchables => \@subsearchers,
);

search.cgi should now function nearly identically to the way it did before.

SearchServer and SearchClient

In a real search cluster, you'll create one master node and several worker nodes. The worker nodes will each run SearchServer, and the master node will run a MultiSearcher which aggregates the results from several SearchClients.

For demo purposes, we'll set up a single SearchServer locally and replace our first subsearcher with a SearchClient.