[KinoSearch] fast phrase matching [patch]

Nathan Kurz nate at verse.com
Fri Sep 28 12:16:11 PDT 2007



On 9/28/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> On Thu, Sep 27, 2007 at 02:06:16PM -0600, Nathan Kurz wrote:
> >I'd wonder whether keeping this [varint routines] more
> > encapsulated in the Posting class might be more flexible.  We'll
> > still probably need the same routines, though.
>
> Other things besides Postings need these routines: Lexicons, term vectors,
> document storage, etc.

Yes, I was being unclear.  The varint routines would still exist
outside as library routines. I think it might make more sense for the
Posting class to get a pointer to raw data (representing one
compressed posting) rather than stream to read from .  This would both
allow for greater flexibility of data store (mmap, SQL database) and
greater efficiency (data copied directly to Scorer without
intermediary).

This last part is dependent on an earlier exchange we had:
http://www.gossamer-threads.com/lists/kinosearch/discuss/1099
In this scheme (which I still like) Posting_read does the
decompression directly from data-on-disk to Scorer-struct.

> > And does SQLite use the same format? The code there is generally pretty and
> > likely well tuned.
>
> Dunno about that.  I've never spelunked the SQLite code base

In my opinion, the core SQLite code is some of the best open source
C-code out there. Also, squinted at from the right point of view it
solves a very similar problem.  There are now full-text-search
extensions for it (http://www.sqlite.org/cvstrac/wiki?p=FtsTwo),
although it doesn't currently support Unicode or complex scoring.

SQLite also has a License that I quite admire:
  The author disclaims copyright to this source code.  In place of a
  legal notice, here is a blessing:
  **    May you do good and not evil.
  **    May you find forgiveness for yourself and forgive others.
  **    May you share freely, never taking more than you give.

It's variable integer code is here:
http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/util.c

>The setvbuf thing is really weird, but I couldn't deny
> the benchmarks.  Maybe an 'objdump -S FSFileDes.o' will tell me something.

I'm not familiar with setvbuf.  I just read the man pages, and my
guess is that it is orthogonal to the system level caching.  But I'm
uncertain.

> No worries, mate.  Can you show me how you normally like your DEBUG and ASSERT
> macros set up?

I will clean things up and try to send a version late tonight.

> > OK.  I went with this because we add the return value to a pointer,
> > and I thought that on a 64-bit system this might save a back and forth
> > conversion.
>
> OK, that's a good enough reason.  I'll change it back.

I didn't test this, though. It's possible the compiler already
optimizes away any conversions, since it is an inline.

> > Hmm, I downloaded the patch I attached, and didn't find any tabs in
> > it. Either I've done something wrong twice, or maybe something else is
> > adding them.

It was me messing up twice.  I tried again, and found the tabs just as
you said.  Problem fixed for the future (I think).

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list