[KinoSearch] KinoSearch::Docs::Cookbook::ReusingSearchers

Henry henka at cityweb.co.za
Fri Sep 14 12:19:43 PDT 2007



> For what it's worth, that approach appeals to me as well.  The
> simplicity of having each node identical seems ideal so long as the
> resources on each machine are reasonably utilized.
>
> I think you mentioned somewhere earlier, but how large is your dataset
> Henry?   Are you using MultiSearcher because your index is too large
> to fit on local disks or is it mid-size and you are trying to keep
> everything in RAM?

There are ~31M docs (growing hourly), with a total size of  several
hundred gigabytes and growing.  So,... in order to provide decent search
performance, the idea is to split the index across several machines in a
cluster.  If performance is limp, simply add more nodes and re-divide.  If
load becomes an issue (it will), apply the same brute force formula.

Nice and simple.

>> You had  also asked about the MultiSearcher sort.  I'd been back-
>> burnering that one because I was hoping that a new approach would
>> present itself during the course of fixing other things.  Well, I
>> believe that one has.
>>
>> What we need is to do is break up nodes by task.
>
> Ouch ---  this runs counter to the simplicity I appreciate about the
> masterless system Henry proposed.    I agree that it would be pretty
> easy to go to programmatically, but it doesn't sound much fun to
> administer.  I see this being of benefit only to really gigantic
> loads/indexes with the hardware customized to the role of each node.
> What are the cases you are thinking it would benefit?
>
>> If you'll recall, the problem with the MultiSearcher sort has to do
>> with the overhead of loading large fields into memory to cut down on
>> disk seeks.  This solves that problem by loading the whole lexicon
>> into one shared space for the whole search cluster.
>
> I think that coming up with a good way of returning the field value to
> the requester is going to be a better final solution.  The fear of
> disk seeks seems like a red herring --- if a block is being read
> often, it's going to be cached, if it's not often, it doesn't matter.
> And If for some reason we are trashing the page cache and forcing a
> re-read, let's figure out how to change that!
>
> But perhaps I'm missing part of this equation?

My primary concern is customer-perceived search times.  Ideally, it should
be sub-second, no matter what.  Marvin's intimate with the details of why
this could be a problem.

I'd love to perform some search tests with a subset of our index - but
multisearcher doesn't have sorting yet (kinda critical in our project). 
;-)

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list