[KinoSearch] opening up the scorers
Marvin Humphrey
marvin at rectangular.com
Thu Apr 17 15:10:25 PDT 2008
On Apr 17, 2008, at 10:43 AM, Nathan Kurz wrote:
>> I've been thinking about adding new public classes ORQuery, ANDQuery,
>> ANDNOTQuery and ANDORQuery. BooleanQuery would either be
>> deprecated or
>> removed; the logic from the compilation phase of BooleanScorer's
>> first
>> iteration would be moved to QueryParser.
>
> This sounds like a good idea to me, especially changing QueryParser to
> build the query directly from the components.
Groovy.
> I think it would be
> great to have a toolbox of component scorers (core or KSx) that can be
> wired together in different ways by custom QueryParsers.
Core is going to need these combining Query/Scorer subclasses.
People could potentially publish KSx subclasses that compile down to
scorers that behave differently from those in core.
> I still can't keep straight how Queries and Scorers relate.
Query is the abstract specification. It's little more than a parse
tree for a search string[1][2].
Scorer does the hard work of actually scoring documents. It's the
practical application of a Query, where Query meets the real world.
Query objects are not tied to any given collection of documents -- you
can apply a Query to different indexes, just as you can search for
"foo AND bar" at either Google or Yahoo.
Scorers, on the other hand, operate against specific indexes.
> AndQuery: short circuit and, scored in some way as a product of
> subqueries?
> OrQuery: score equal to best scoring subquery, could be short
> circuit if sorted?
> AndOrQuery: score all subqueries and add them, possibly normalized?
> AndNotQuery: not sure why this isn't a NotQuery, scored as a
> constant?
ANDQuery - Search for 'a AND b'.
ORQuery - Search for 'a OR b'.
ANDNOTQuery - Search for 'a AND NOT b'.
ANDORQuery is the odd one out, because it doesn't really mean 'a AND/
OR b'. What it does is combine one optional clause and one required
clause.
ANDORQuery - Search for 'a +b'
I chose those names because they seemed clearer than the Lucene
equivalents. Here's the mapping of Scorer subclasses:
KS Lucene
=======================================================
ANDScorer ConjunctionScorer
ORScorer DisjunctionSumScorer
ANDNOTScorer ReqExclScorer
ANDORScorer ReqOptSumScorer
"ConjunctionScorer" I thought was a particularly poor name.
Grammatically speaking both 'OR' and 'AND' are "conjunctions", but the
"Conjunction" in "ConjunctionScorer" doesn't refer to *that* kind of
conjunction -- which is really confusing.
> I agree that it probably can't be the default OrQuery/OrScorer, but it
> strikes me as a useful piece of rope to tempt users who are creating
> their own queries. It also might be useful to think about how Queries
> could be split across cores/servers. If it worked, there would be
> some performance benefits of doing so per term rather than
> partitioning the corpus.
Sure, there are definitely performance benefits to term-at-a-time - it
just doesn't scale well when you need to apply boolean constraints.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
[1] Query would be more accurately described as akin to an "abstract
syntax tree" for a search string rather than a "parse tree".
<http://en.wikipedia.org/wiki/Abstract_syntax_tree>
[2] It's possible to use Query objects to build query specifications
that are difficult or nearly impossible to type into a search
box.
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list