[KinoSearch] OpenQueryParser (was "opening up the scorers")

Nathan Kurz nate at verse.com
Sun Apr 27 15:07:25 PDT 2008



On Wed, Apr 23, 2008 at 10:21 PM, Marvin Humphrey
<marvin at rectangular.com> wrote:
>  The problem faced by any of these single-field parsers, though, is that
> things get messy when you try to combine queries that involve multiple
> fields, which is a very common practical need.
> ...
>  I don't see a way to fix that problem except at a low-level via a
> multi-field parser.  Do you?

You could have the Parser build a tree with a special field type of
'any', which then gets expanded out to multiple fields at a later
stage.  I'd sort of like to have this stage anyway, since it keep the
Parser more independent of the Index, and would let me do tricks like
replacing OrScorer with MyOrScorer.   Instead of trying to build an
optimizing Parser, you could do the optimizations and checks in a
separate pass and keep the Parser simpler.

> > A stray thought:   QueryParser implies that it is parsing a Query,
> > whereas it's probably clearer to think of it as building a query from
> > some text, with the output tree being the actual Query.  I don't
> > suppose that QueryBuilder strikes you as a clearer name?  It would
> > make it clearer what it does...
> >
>
>  It's arguable.  QueryParser does parse a query string, after all.

I think that's part of the problem.  In my mind, a Query is just
string, not a Tree.  Having a QueryParser that parses a Query (string)
and returns a ParseTree would be great.  Having it parse a query and
return a Query is confusing.

>  The goal is to behave as an end user typing into a search box on a website
> would expect.  The big web search engine sites set the trends, and
> KinoSearch's core QueryParser follows.

Do users really expect this behaviour, or is a shortcut taken by
programmers?  Realizing that probably only a tiny number of end users
ever use stop words at all, if _I_ were to type '-foo'  into a site
search box, I would expect it to return all documents that do not
contain the word 'foo', probably ordered by popularity.  This would
certainly be more useful than claiming that no documents match.   That
said, despite urging you to make KinoSearch more general, I agree that
out of the box it should work the way that users expect as a site
search engine, and that any other uses should be secondary.

> > My main preference would be to have the Scorer
> > capable of ordering and returning large numbers of results without
> > blowing up --- whether it does so by default is merely a detail.
> >
>
>  KS won't blow up, because the standard TopDocs search uses a finite-sized
> HitQueue to order results on the fly as scoring proceeds rather than
> accumulating a giant array of hits and sorting by score at the end.

'Blow up' was sloppy speech on my part.  'Grind to a halt' would
probably be closer.   I'd like to have a Scorer that is either smart
enough to avoid processing the entire index for queries that match
(almost) every document, or fast enough that processing the entire
index is no big deal.

I haven't thought about it for a while, but at one point I had a
scheme to do this with a minimum document number and a maximum
document score.  If the whole HitQueue was at the max score, you could
return early.  If a max score occurs at less than the minimum document
number, you skip it as already returned.  This would let you
semi-efficiently do things like return hits 1,000,000 to 1,000,100,
although sometimes you'd need a second pass to pick up stragglers.

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list