[KinoSearch] KinoSearch::Docs::Cookbook::ReusingSearchers

Nathan Kurz nate at verse.com
Fri Sep 14 14:16:28 PDT 2007



On 9/14/07, Henry <henka at cityweb.co.za> wrote:
> > What we need is to do is break up nodes by task.  We'll have a
> > dedicated lexicon server, a dedicated document server, one or more
> > master search nodes, and multiple machines dedicated to the task of
> > crunching through posting lists.  The lexicon server and the document
> > server will be abstractions behind which one or more machines will sit.
>
> OK - this makes things a bit more involved from an admin perspective - and
> also requires a lot more hardware (if I understand your proposed
> architecture correctly).
>
> Correct me if I'm wrong:  instead of chopping up the index into smaller
> and smaller sub-indexes (spread across multiple 'master' search nodes) to
> improve performance and handle concurrent search load, you're proposing
> separate physical machines with specialized roles/tasks (lexicon, docs,
> search)?

His proposal doesn't require separate physical machines ----
everything is virtual.  And you still split the index across machines
for performance just a you are doing. The difference is that at least
one machine has a full copy of the portion of the index that allows it
to do the sorting.

Instead of having to ask the search node for this information, it can
look it up locally, and because this is all it does it could keep this
information in memory.  This would likely be the only physical machine
you would be adding.

> By "multiple machines dedicated to the task of crunching through posting
> lists" I presume you mean the indexing machines?

I think he's referring to the core of the search process here, rather
than indexing: "chewing through raw posting lists and doing nothing
but spitting out document numbers and scores".  The indexing would be
separate.

Marvin's right that this approach could be more efficient.  I'm
worried that the complexity added by requiring one omniscient machine
is large, and think that coming up with an efficient way to return
field values would be of more general use.  He may be right, though.

> There are ~31M docs (growing hourly), with a total size of  several
> hundred gigabytes and growing.
> ...
> My primary concern is customer-perceived search times.  Ideally, it should
> be sub-second, no matter what.

I haven't played with large data sets like this yet.  How long does a
straight search take if you were to run it on a single machine with
the full dataset?  Conversely, how large can you make the partitioned
index in the search farm while keeping response time where you want
it.  I'm interested in what the overhead of the MultiSearcher approach
actually is.

Thanks!

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list