[KinoSearch] fast phrase matching [patch]

Marvin Humphrey marvin at rectangular.com
Fri Sep 28 18:40:02 PDT 2007



On Fri, Sep 28, 2007 at 01:16:11PM -0600, Nathan Kurz wrote:
> Yes, I was being unclear.  The varint routines would still exist
> outside as library routines. I think it might make more sense for the
> Posting class to get a pointer to raw data (representing one
> compressed posting) rather than stream to read from .  This would both
> allow for greater flexibility of data store (mmap, SQL database) and
> greater efficiency (data copied directly to Scorer without
> intermediary).

OK, this is worth its own thread.  Since we have one going called "The Life
Cycle of a Posting", I'll use that.

> > > And does SQLite use the same format? The code there is generally pretty and
> > > likely well tuned.
> >
> > Dunno about that.  I've never spelunked the SQLite code base
> 
> In my opinion, the core SQLite code is some of the best open source
> C-code out there. Also, squinted at from the right point of view it
> solves a very similar problem.  There are now full-text-search
> extensions for it (http://www.sqlite.org/cvstrac/wiki?p=FtsTwo),
> although it doesn't currently support Unicode or complex scoring.

Haha, when I pigeonholed Dr. Hipp a year ago at OSCON, he mentioned that he
had studied the Lucene file format.  At a glance, it looks like he's using the
same basic segmentation scheme used by Lucene and KS.

> It's variable integer code is here:
> http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/util.c

Looks like he's using the BER format.  There are a couple neat things
about his implementation.  We'll deal with it in a specialized thread.

> >The setvbuf thing is really weird, but I couldn't deny
> > the benchmarks.  Maybe an 'objdump -S FSFileDes.o' will tell me something.
> 
> I'm not familiar with setvbuf.  I just read the man pages, and my
> guess is that it is orthogonal to the system level caching.  But I'm
> uncertain.

Well, here's the deal: fread(), fwrite() and friends are buffered i/o.  But KS
InStream and OutStream objects maintain their own buffers.  (These buffers are
1k each, but I've tried other sizes.)  It should be possible to turn off the
buffering via setvbuf and still get all the classic benefits of buffered i/o.
But if I do that, KS starts suckin' pondwater (to quote my high school soccer
coach).

It's been a while now, but I'm almost certain I tried this on three different
systems -- my G4 laptop, a Pentium 4 running FreeBSD 5.3, and a dual Xeon
running Red Hat 9 -- and KS sucked pondwater on all of 'em.

I'm tempted to try re-implementing InStream to use mmap exclusively.  I think
that the i/o usage patterns of KS might make it a suitable candidate for that
treatment.  But then your idea of allowing internal classes to bypass the
stream classes altogether is interesting.

One of the downsides of bypassing would be that any class that did that
wouldn't be able to use RAMFolder.  RAMFolder's "files" are implemented as a
ragged array of ByteBuf objects, so you don't get a pointer to a contiguous
chunk of memory you can pass along.  RAMFolder is mostly for testing, but I'd
be crying in my beer if all the KS tests had to use disk i/o.

Another thing about mmap: how well does it work on 32-bit systems when dealing
with large files (which are common with KS)?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list