[KinoSearch] serializing safely

Marvin Humphrey marvin at rectangular.com
Thu Jun 14 19:42:28 PDT 2007


On Jun 14, 2007, at 5:42 PM, Nathan Kurz wrote:

> First, your architecture sounds reasonable to me:  if searches are
> never going to cross indexes, keeping them separate for each user
> seems like a reasonable idea.

I fully agree.

You want to avoid processing hits that you know can't match.   
Definitely, break up the indexes if you know you will never have to  
multiplex search results across them.

Search costs are dominated by the time that it takes to process the  
matches for common terms.  If you're looking for 'orpheus', that's  
probably cheap; '+black +orpheus' will be more expensive in  
comparison, assuming that 'black' is a more common term in the  
corpus.  Even though the intersection of the set that matches 'black'  
and the set that matches 'orpheus' is small, you still have to  
iterate over *all* the matches for both terms.

OTOH, if you knew you had to multiplex results from time to time,  
searching several indexes is more expensive, particularly in terms of  
disk i/o.  In a single index, all the information about any given  
term will be relatively concentrated.  With multiple indexes, the  
information is more scattered, so the disk has to seek a lot more.

>   the easiest solution may be to partition the search off
> to separate machines, each handling only a subset of your users.
> Rather than thinking about caching  Searcher objects within the
> FastCGI, you could prepare for this eventuality by running your search
> in an external server process, either on the same machine or another.
> This process could then cache Searchers for the indexes of the most
> recent users and use the appropriate one for the search.

This is a good plan.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list