[KinoSearch] Re: Wildcards

Marvin Humphrey marvin at rectangular.com
Wed Feb 13 21:03:03 PST 2008


On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:

> Father C (and lurkers), I think it would be great if you could write
> up your overview as well.  Even if you haven't poked around all the
> innards in depth, you're much closer to the way it works than most
> users will ever be.  So without reference to how it actually works,
> write up something describing how it should work.

Quickly...

I'm pretty close to a coherent API for Weight.

I just refactored so that boosts are dealt with only at construction  
time.  They now propagate from Query to Weight in a very  
straightforward way: simple Query types (TermQuery, PhraseQuery) just  
copy the value, while compound query types (BooleanQuery) multiply in  
their own boost of their sub-queries.  Here's a snip from  
BooleanQuery.pm:

     # iterate over the clauses, creating a Weight for each one
     my $boost = $self->get_boost;
     my @sub_weights;
     for my $clause ( @{ $self->get_parent->get_clauses->to_perl } ) {
         my $sub_query  = $clause->get_query;
         my $sub_boost  = $boost * $sub_query->get_boost;
         my $sub_weight = $sub_query->make_weight(
             searchable => $searchable,
             boost      => $sub_boost,
         );
         push @sub_weights, $sub_weight;
     }
     $sub_weights{$$self} = \@sub_weights;

What's left to refactor is to divide the remaining methods into two  
tasks: calculate a raw value, and normalize.

In the end, we'll have something like this:

    sub get_value {
        my $self = shift;
        my $value = $self->get_raw_value;
        $value *= $self->get_boost;
        $value *= $self->get_norm_factor;
        return $value;
    }

Methods like like sum_of_squared_weights, etc, have esoteric meanings  
related to cosine similarity measures and other IR theory.  It might  
be kind of hard to write them up if you aren't up-to-speed on the  
relevant topics.  Also, the architecture inherited from Lucene was a  
spaghettified mess -- the code is hard to follow.  While it would be  
cool to see writeups, *I* have a hard time with this part of the code  
base -- a lot was cargo-culted then verified only by comparing KS  
scores against Lucene scores.

Lemme finish revising Weight *before* anybody writes it up.  Then I  
look forward to making a second leap forward after some feedback.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the kinosearch mailing list