[KinoSearch] opening up the scorers

Marvin Humphrey marvin at rectangular.com
Thu Apr 17 15:10:25 PDT 2008




On Apr 17, 2008, at 10:43 AM, Nathan Kurz wrote:
>> I've been thinking about adding new public classes ORQuery, ANDQuery,
>> ANDNOTQuery and ANDORQuery.  BooleanQuery would either be  
>> deprecated or
>> removed; the logic from the compilation phase of BooleanScorer's  
>> first
>> iteration would be moved to QueryParser.
>
> This sounds like a good idea to me, especially changing QueryParser to
> build the query directly from the components.

Groovy.

> I think it would be
> great to have a toolbox of component scorers (core or KSx) that can be
> wired together in different ways by custom QueryParsers.

Core is going to need these combining Query/Scorer subclasses.

People could potentially publish KSx subclasses that compile down to  
scorers that behave differently from those in core.

> I still can't keep straight how Queries and Scorers relate.

Query is the abstract specification.  It's little more than a parse  
tree for a search string[1][2].

Scorer does the hard work of actually scoring documents.  It's the  
practical application of a Query, where Query meets the real world.

Query objects are not tied to any given collection of documents -- you  
can apply a Query to different indexes, just as you can search for  
"foo AND bar" at either Google or Yahoo.

Scorers, on the other hand, operate against specific indexes.

> AndQuery: short circuit and, scored in some way as a product of  
> subqueries?
> OrQuery: score equal to best scoring subquery, could be short  
> circuit if sorted?
> AndOrQuery: score all subqueries and add them, possibly normalized?
> AndNotQuery:  not sure why this isn't a NotQuery, scored as a  
> constant?

   ANDQuery    - Search for 'a AND b'.
   ORQuery     - Search for 'a OR b'.
   ANDNOTQuery - Search for 'a AND NOT b'.

ANDORQuery is the odd one out, because it doesn't really mean 'a AND/ 
OR b'.  What it does is combine one optional clause and one required  
clause.

   ANDORQuery  - Search for 'a +b'

I chose those names because they seemed clearer than the Lucene  
equivalents.  Here's the mapping of Scorer subclasses:

   KS                            Lucene
   =======================================================
   ANDScorer                     ConjunctionScorer
   ORScorer                      DisjunctionSumScorer
   ANDNOTScorer                  ReqExclScorer
   ANDORScorer                   ReqOptSumScorer

"ConjunctionScorer" I thought was a particularly poor name.   
Grammatically speaking both 'OR' and 'AND' are "conjunctions", but the  
"Conjunction" in "ConjunctionScorer" doesn't refer to *that* kind of  
conjunction -- which is really confusing.

> I agree that it probably can't be the default OrQuery/OrScorer, but it
> strikes me as a useful piece of rope to tempt users who are creating
> their own queries.  It also might be useful to think about how Queries
> could be split across cores/servers.  If it worked, there would be
> some performance benefits of doing so per term rather than
> partitioning the corpus.

Sure, there are definitely performance benefits to term-at-a-time - it  
just doesn't scale well when you need to apply boolean constraints.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] Query would be more accurately described as akin to an "abstract
     syntax tree" for a search string rather than a "parse tree".
     <http://en.wikipedia.org/wiki/Abstract_syntax_tree>

[2] It's possible to use Query objects to build query specifications
     that are difficult or nearly impossible to type into a search
     box.

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list