[KinoSearch] Scaling KinoSearch

Nathan Kurz nate at verse.com
Mon May 26 13:00:39 PDT 2008


On Wed, May 21, 2008 at 3:35 PM, Nathan Kurz <nate at verse.com> wrote:
> Rather than using the
> same KinoSearch architecture for both the Boss and the Worker, I would
> have them interact via a limited and interface (probably via HTTP,
> although possibly FCGI or memcached protocol).
>
> The Boss would receive a text query, and pass it on --- as text --- to
> a pool of Searchers.  The Searchers, using the current KinoSearch
> implementation wrapped by a thin persistence layer, would perform the
> search and return a list of Documents and Scores.  The Boss would
> select the highest results from the responses (skipping Workers that
> take too long), and (if desired) make a second set of requests to
> another pool to get highlighted excerpts.
>
> The Boss would be completely ignorant of how the Searchers search ---
> for all it knows, they could be running Lucene, making a SQL query, or
> even be proxying to Google. The Searchers would be completely
> independent of one another --- even document numbers would not need to
> be presumed unique.
>
>  Are there disadvantages to this approach?  I like its transparency,
> and that it doesn't add complexity to the KinoSearch core.

I've thought more about this approach, and I think it is sound.  It
should be pretty easy to implement using existing tools, and it should
scale very well.  One nice thing is that it can be in addition to the
internal OO search distribution that KinoSearch already does.

I haven't actually done it, but I think it could be implemented as a
plugin to Perlbal: <http://www.danga.com/perlbal/>.   Perlbal is a
single-threaded reverse proxy, normally used to load balance between
pool of backend servers.  But it seems pretty straightforward to
modify it to make requests  from all of the servers in a pool, and
aggregate the responses.  I asked about this on the Perlbal list, and
got a pleasant positive response:
http://lists.danga.com/pipermail/perlbal/2008-May/000969.html

Perlbal would maintain a persistent connection to each of the shards,
whether by HTTP, FCGI, or something custom.  These shards would be
grouped into pools, such that if one pool is busy another pool is
used.  If all pools are busy, the request is held until a pool frees
up.   If a request to a shard takes longer than some threshold time,
it is canceled and the responses from the on-time shards are used.

Perlbal has good a good management interface, and in the future it
should be possible to design a layer on top of this that could respawn
shards as necessary.  Or if using a cloud computing approach like
Amazon EC2, it should be possible to actively add and remove pools in
response to demand.  I think this can be added at some later date,
though, and does not require immediate thought.

One possible trickiness is that Perlbal's single-threaded event-driven
approach doesn't leave much room for processing the responses.  But
finding the top responses from a few dozen shards shouldn't be much of
a problem.  I've thought some baout efficient algorithms to do this,
but haven't tried any yet.  And the upside of of the single-threaded
approach is that connection caching can be very efficient.

I won't be able to actually get this done for a while, but that
shouldn't stop anyone else from doing it.  One of the nice parts about
this approach is that very little needs to be done to the KinoSearch
core to make this possible --- other than a thin wrapper with the
appropriate interface, it's all already there.  And while I haven't
thought about it as much, but I think the Doc servers could use the
same approach for getting highlighted excerpts.

Nathan Kurz
nate at verse.com



More information about the KinoSearch mailing list