[KinoSearch] serializing safely

Nathan Kurz nate at verse.com
Thu Jun 14 17:42:28 PDT 2007


On 6/14/07, Hans Dieter Pearcey <hdp at pobox.com> wrote:
> On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin Humphrey wrote:
> > >My app primarily differs in that I was planning on having many
> > >invindexes, two
> > >or three per user, so opening them all at program start would
> > >probably be
> > >inefficient (there are several hundred of them).
> >
> > OK.  With that architecture, you'll need to factor in the time it
> > takes to begin reading from any one of those invindexes.
>
> It may be a stupid architecture; I'm not really very experienced with
> invindexes.  I want to index about 250G of email, which seems like a lot to me,
> so I'm assuming that partitions will be useful (since each user only searches
> their own email).  Am I prematurely optimizing?

Hi Hans ---

I've been thinking about some similar architectural issues, and while
I don't have any experience with corpus sizes as large as you were
dealing with, I thought I'd jump in.

First, your architecture sounds reasonable to me:  if searches are
never going to cross indexes, keeping them separate for each user
seems like a reasonable idea.  Yes, the initialization costs of each
Searcher object will be expensive, but I think the smaller size of
each index is going to offset this.  Starting with this architecture
strikes me as good forethought, and not premature.

Worrying about caching hot Searcher objects to those indexes does
strike me premature,  or possibly misguided.  The thing that takes the
most time (I'm guessing) is reading the index from the disk, thus
caching the object to disk isn't going to help you a lot.  To get a
real advantage, you are going to need it hanging around in RAM, and
given the size of your corpus this is going to require finesse.

Presuming you are running Linux, most extra RAM on the system will be
used to cache recently read files so that they can read from
relatively fast memory rather than waiting for the relatively very
slow disk.  The more you cache big objects, the less space available
for the system to cache files.  It's a trade:  if you know you are
going to reuse the object, it's a win, but if you don't you are
probably better off letting the system do its thing.  I'd wait and
measure.

If disk IO does turn out to be a bottleneck (and it will with heavy
enough usage) the easiest solution may be to partition the search off
to separate machines, each handling only a subset of your users.
Rather than thinking about caching  Searcher objects within the
FastCGI, you could prepare for this eventuality by running your search
in an external server process, either on the same machine or another.
This process could then cache Searchers for the indexes of the most
recent users and use the appropriate one for the search.

Alternatively, you could cache a small number of Searcher objects in
each FastCGI process, and then come up with a way of preferentially
directing users to the same process they used on the previous request.
 Historically, there have been some affinity patches for mod_fastcgi
that did this, but I don't know if they have been updated.   But in
general, I don't think there is going to be any good way for multiple
processes or threads to share a single Searcher object.
I'd start by sticking with the separate indexes, skipping the caching,
and seeing how it goes.


Hope this helps,

Nathan Kurz
nate at verse.com



More information about the kinosearch mailing list