[KinoSearch] Wildcards (was Re: KinoSearch feature suggestions)

Nathan Kurz nate at verse.com
Wed Jan 23 23:47:25 PST 2008



On 1/23/08, Marvin Humphrey <marvin at rectangular.com> wrote:
> If we punt on scoring, it might make sense from an i/o standpoint to
> iterate through all the matches up front and save a BitVector with
> matching doc nums set.

Hi Marvin!  I'm only half paying attention from here in the peanut
gallery, but this got my attention.  Don't punt on the scoring!

>From my naive point of view, a wildcard just looks like another way of
specifying a boolean OR.  Why not expand it out with the parser level?
 Sure it might be really big, but there's nothing wrong with providing
support for industrial strength boolean queries.  Of course, I say
that because I'm going to want them one day for my own nefarious
purposes, and with flexible scoring at that.

> Actually, if we iterate up front, we could find out the IDF of the
> fragment and then use that to assess a crude score.

I will be so appreciative some day if you move away from architectures
that presumes IDF is always going to be the way that things are
scored.

> The problem we have now is that the priority queue of PostingLists
> probably isn't a good way to zip through a lot of matching terms.
> There's going to be some disk seeking, as the results for "peter" and
> "petroleum" and "petunia" are interleaved.  Hmm...

All these disk seeks you are seeing...  have you ever caught one in
the wild?  Modern systems have several levels of caching between you
and the disk head.  While truly random lookup is bad, anything
remotely predictable is probably going to be cached.  Don't avoid the
interleaved data until you're sure it is going to be a problem.

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list