[KinoSearch] fast phrase matching [patch]
Marvin Humphrey
marvin at rectangular.com
Sun Sep 30 20:39:10 PDT 2007
On Sun, Sep 30, 2007 at 03:11:15PM -0600, Nathan Kurz wrote:
> I think the major problem is going to be how to fake the mmap() for Windows
> systems where it does not exist.
Do a web search for 'windows mmap' -- it doesn't look like it's too hard to
fake up. This is the kind of thing Charmonizer is for.
> It's possible that a compromise is possible where we keep the
> stream classes, but change them to return raw data in
> page-size-multiple chunks, with Windows double-buffering and the Linux
> implementation doing nothing other returning a pointer into a mapped
> region.
The priority is to get InStream to use mmap. OutStream is less important, and
I'm not even thinking about it right now.
FWIW, other than going back and creating the compound file for each segment,
there is never any re-reading during the index creation. (Hmm. I think there
aren't even any seeks. It might be possible to eliminate OutStream_SSeek).
Also, all files are written once and never revised.
> > RAMFolder is mostly for testing, but I'd
> > be crying in my beer if all the KS tests had to use disk i/o.
>
> The goal is to get everything running as fast (or faster) than
> RAMFolder works now. There is not going to any physical disk i/o
> happening during testing other than that needed to get the data cached
> by the system page buffer the first time it is read.
The majority of tests don't hit disk *at all*. It's going to be hard to beat
that. :)
> > Another thing about mmap: how well does it work on 32-bit systems when dealing
> > with large files (which are common with KS)?
>
> I don't think this is going to be a problem, although I haven't
> thought it through in detail. The underlying implementation of
> system read() is essentially mmap(), so we shouldn't hit any
> fundamental problems. The total amount mapped at one time can't be
> larger than the address space (< 4GB for 32-bit Linux), but I think we
> can solve this by mapping and unmapping as necessary.
What I was thinking we'd try is just substituting an mmap/munmap pair for each
buffer refill we currently perform. Sounds like you and I are on the same
page. ;)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list