[KinoSearch] opening up the scorers

Marvin Humphrey marvin at rectangular.com
Fri Apr 18 18:50:45 PDT 2008




On Apr 17, 2008, at 10:15 PM, Nathan Kurz wrote:
> So the tree of Queries is used to build a tree (typically) of Scorers,
> and each Query class has a one-to-one relationship with a Scorer
> class?

It's close to a one-to-one relationship, but it's not, quite.  Some  
optimizations are possible when compiling the Scorers.

For instance, if someone has created a PhraseQuery that only has one  
term in it, you know you can compile that down to a TermScorer instead  
of PhraseScorer.  Or, even better, say you have a simple TermQuery,  
and you find out that the term isn't in the index (because $searchable- 
 >doc_freq returns 0).  Then you can just return undef (indicating a  
null result set) instead of a Scorer.

> Is there any 'query' specific code in the query beyond the
> name of the Scorer class?

There is actually quite a lot that happens in between a Query and a  
Scorer.  That's where the "Weight" classes come in - they encapsulate  
the process of compiling a Query to a Scorer.

Query classes are indeed, very simple.  There's not much to them  
excerpt for a make_weight() factory method (and an extract_terms()  
method I'd really like to kill off).

> My desire for simplicity makes me wonder if
> one could just have a single 'QueryNode' class that instantiates a
> customizeable Scorer.

I don't quite follow.

>> People could potentially publish KSx subclasses that compile down to
>> scorers that behave differently from those in core.
>
> For a custom OrScorer that I'm interested in (short-circuit OR,
> returns the score of the first match of the ordered children) what
> would I subclass and how would I call it?

Your ORQuery subclass would probably look like this:

   package FirstMatchORQuery;
   use base qw( KinoSearch::Search::ORQuery );

   sub make_weight {
      my $self = shift;
      return FirstMatchORWeight->new( @_, parent => $self );
   }

   package FirstMatchORWeight;
   ...

> My instinct is it would be
> simplest just to build the Scorer tree myself and stick with my
> FirstMatchScorer in at the appropriate places.   But what would the
> right way be?

You mean how would you persuade QueryParser to use your ORQuery  
variant rather than the default?  Probably we'd need to give  
QueryParser some sort of make_orquery() factory method you could  
override.

I'm not sure I want that to happen right away in core, though.   
QueryParser-type classes are sadly prone to death by Featuritis.  This  
is the kind of thing I'd rather see refined via KSx.

>>  ANDQuery    - Search for 'a AND b'.
>>  ORQuery     - Search for 'a OR b'.
>>  ANDNOTQuery - Search for 'a AND NOT b'.
>
> Why not just have a NotQuery?

Good question, and I think, good suggestion.

When we swap out ANDNOTQuery for NOTQuery, all of a sudden we get a  
coherent suite:

   ANDQuery
   ORQuery
   NOTQuery
   ReqOptQuery

Background:

NOTQuery hasn't been needed up till now.  QueryParser doesn't parse  
'NOT brobniquitz' down to a NOTQuery because it's standard behavior  
for search engines to parse that kind of thing as a void query with no  
result set rather than return the universe.

> It seems like it would be more general,
> and one could always build the 'a AND NOT b' using an AND and a NOT.

I think this is probably a good plan.  I played back a couple  
scenarios in my mind to see whether the combination of an ANDScorer  
and a NOTScorer would needlessly iterate over more results than an  
ANDNOTScorer would, but with Scorer_Skip_To, I couldn't come up with a  
case where that would happen.

There's going to be a marginal increase in CPU overhead from wrapping  
a positive scorer with a NOTScorer, but I doubt it will matter.

>> ANDORQuery is the odd one out, because it doesn't really mean 'a  
>> AND/OR b'.
>> What it does is combine one optional clause and one required clause.
>
> Ditto.  Why not just layer an AND and an OR?

I don't think that's quite the same thing??

> Or an AND with a
> hypothetical 'OptionalTermScorer' that returns some non-zero score if
> the term is not found?

If I follow what you're saying, I think that would sort of work, but  
it's no clearer conceptually than a ReqOptQuery combining one required  
clause and one optional clause.

> I do like the that Lucene names mention
> that they are 'Sum' scorers, though, as it seems useful to distinguish
> how the actual scoring is done.

Right.  FYI, there's also a DisjunctionMaxScorer, which is mated with  
a DisjunctionMaxQuery.

> ps. The ice cream goes pretty well: http://screamsorbet.com/

Beet Lemon Sorbet!  Awesome.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list