[KinoSearch] OpenQueryParser (was "opening up the scorers")
Marvin Humphrey
marvin at rectangular.com
Wed Apr 23 21:21:03 PDT 2008
On Apr 22, 2008, at 2:39 PM, Nathan Kurz wrote:
> On Mon, Apr 21, 2008 at 10:54 PM, Marvin Humphrey
> <marvin at rectangular.com> wrote:
>> Instead of opening up the core class, I'd be more inclined to write
>> and
>> release KSx::Search::OpenQueryParser, which would look a lot like the
>> current QueryParser but single-field and with factory methods.
>
> I'm pretty happy with that approach, although I think there might be a
> little more safe maneuvering room before the slope gets slippery.
> Opening up the API to allow syntax changes seems excessive;
I think another nice feature for OpenQueryParser would be to base it
on Parse::YAPP or something like that, and have the grammar be
configurable via a constructor param. The core QueryParser is based
off of regexes, which is fast and dependency-free but not extensible.
A grammar-based QueryParser would offer more opportunities for
customization.
The problem faced by any of these single-field parsers, though, is
that things get messy when you try to combine queries that involve
multiple fields, which is a very common practical need. Say you're
searching for "foo AND NOT bar", you parse that twice for the "title"
and "content" fields, then join the two parsed queries with an
ORQuery. You end up with something like this:
title:(foo AND NOT bar) OR content:(foo AND NOT bar)
Unfortunately, ORing the two result sets together means that any
document where the title matches 'foo AND NOT bar' will match
regardless of whether the content field contains 'bar' -- and that's
probably not what the end user wants.
This was a bug in KS that got fixed a while back, and it took making
QueryParser multi-field to fix it. What QueryParser does now is
essentially this:
(title:foo OR content:foo) AND NOT (title:bar OR content:bar)
I don't see a way to fix that problem except at a low-level via a
multi-field parser. Do you?
> I ignored
> QueryParser and wrote my own, but my fear is that the burden of
> writing a parser is going to stop anyone from casually experimenting
> with different scorers.
I can see that. There are some custom Query subclass concepts that
don't require mods to QueryParser to be practical, but there are lots
that do.
> A stray thought: QueryParser implies that it is parsing a Query,
> whereas it's probably clearer to think of it as building a query from
> some text, with the output tree being the actual Query. I don't
> suppose that QueryBuilder strikes you as a clearer name? It would
> make it clearer what it does...
It's arguable. QueryParser does parse a query string, after all.
>>> I strongly think you want to 'return the universe' [for a bare NOT
>>> query].
>>>
>>
>> Returning the universe is a perfectly reasonable behavior for some
>> applications. However, I strongly disagree that it should be the
>> default
>> behavior for the core QueryParser.
>>
>> If I write NOTQuery, at least then it becomes possible to implement
>> your
>> desired behavior. It's probably best if I focus my energies on
>> that task.
>
> Probably an agree to disagree sort of situation.
The goal is to behave as an end user typing into a search box on a
website would expect. The big web search engine sites set the trends,
and KinoSearch's core QueryParser follows.
However, I think it would be reasonable for OpenQueryParser to have
return-the-universe behavior by default. How easy would it be to
switch it off?
> My main preference would be to have the Scorer
> capable of ordering and returning large numbers of results without
> blowing up --- whether it does so by default is merely a detail.
KS won't blow up, because the standard TopDocs search uses a finite-
sized HitQueue to order results on the fly as scoring proceeds rather
than accumulating a giant array of hits and sorting by score at the end.
> So yes, implementing a NOTQuery that the default parser optimizes out
> would be just fine for my purposes, although I might try to argue that
> this optimization should take place at some later stage to allow for a
> simpler Parser.
I don't think we can move it outside of QueryParser. We'd have to
adapt every search method individually, which wouldn't be practical or
wise.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list