[KinoSearch] KinoSearch::Docs::Cookbook::ReusingSearchers
Henry
henka at cityweb.co.za
Fri Sep 14 12:19:43 PDT 2007
> For what it's worth, that approach appeals to me as well. The
> simplicity of having each node identical seems ideal so long as the
> resources on each machine are reasonably utilized.
>
> I think you mentioned somewhere earlier, but how large is your dataset
> Henry? Are you using MultiSearcher because your index is too large
> to fit on local disks or is it mid-size and you are trying to keep
> everything in RAM?
There are ~31M docs (growing hourly), with a total size of several
hundred gigabytes and growing. So,... in order to provide decent search
performance, the idea is to split the index across several machines in a
cluster. If performance is limp, simply add more nodes and re-divide. If
load becomes an issue (it will), apply the same brute force formula.
Nice and simple.
>> You had also asked about the MultiSearcher sort. I'd been back-
>> burnering that one because I was hoping that a new approach would
>> present itself during the course of fixing other things. Well, I
>> believe that one has.
>>
>> What we need is to do is break up nodes by task.
>
> Ouch --- this runs counter to the simplicity I appreciate about the
> masterless system Henry proposed. I agree that it would be pretty
> easy to go to programmatically, but it doesn't sound much fun to
> administer. I see this being of benefit only to really gigantic
> loads/indexes with the hardware customized to the role of each node.
> What are the cases you are thinking it would benefit?
>
>> If you'll recall, the problem with the MultiSearcher sort has to do
>> with the overhead of loading large fields into memory to cut down on
>> disk seeks. This solves that problem by loading the whole lexicon
>> into one shared space for the whole search cluster.
>
> I think that coming up with a good way of returning the field value to
> the requester is going to be a better final solution. The fear of
> disk seeks seems like a red herring --- if a block is being read
> often, it's going to be cached, if it's not often, it doesn't matter.
> And If for some reason we are trashing the page cache and forcing a
> re-read, let's figure out how to change that!
>
> But perhaps I'm missing part of this equation?
My primary concern is customer-perceived search times. Ideally, it should
be sub-second, no matter what. Marvin's intimate with the details of why
this could be a problem.
I'd love to perform some search tests with a subset of our index - but
multisearcher doesn't have sorting yet (kinda critical in our project).
;-)
Regards
Henry
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list