[KinoSearch] Re: Wildcards
Marvin Humphrey
marvin at rectangular.com
Wed Feb 13 21:03:03 PST 2008
On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:
> Father C (and lurkers), I think it would be great if you could write
> up your overview as well. Even if you haven't poked around all the
> innards in depth, you're much closer to the way it works than most
> users will ever be. So without reference to how it actually works,
> write up something describing how it should work.
Quickly...
I'm pretty close to a coherent API for Weight.
I just refactored so that boosts are dealt with only at construction
time. They now propagate from Query to Weight in a very
straightforward way: simple Query types (TermQuery, PhraseQuery) just
copy the value, while compound query types (BooleanQuery) multiply in
their own boost of their sub-queries. Here's a snip from
BooleanQuery.pm:
# iterate over the clauses, creating a Weight for each one
my $boost = $self->get_boost;
my @sub_weights;
for my $clause ( @{ $self->get_parent->get_clauses->to_perl } ) {
my $sub_query = $clause->get_query;
my $sub_boost = $boost * $sub_query->get_boost;
my $sub_weight = $sub_query->make_weight(
searchable => $searchable,
boost => $sub_boost,
);
push @sub_weights, $sub_weight;
}
$sub_weights{$$self} = \@sub_weights;
What's left to refactor is to divide the remaining methods into two
tasks: calculate a raw value, and normalize.
In the end, we'll have something like this:
sub get_value {
my $self = shift;
my $value = $self->get_raw_value;
$value *= $self->get_boost;
$value *= $self->get_norm_factor;
return $value;
}
Methods like like sum_of_squared_weights, etc, have esoteric meanings
related to cosine similarity measures and other IR theory. It might
be kind of hard to write them up if you aren't up-to-speed on the
relevant topics. Also, the architecture inherited from Lucene was a
spaghettified mess -- the code is hard to follow. While it would be
cool to see writeups, *I* have a hard time with this part of the code
base -- a lot was cargo-culted then verified only by comparing KS
scores against Lucene scores.
Lemme finish revising Weight *before* anybody writes it up. Then I
look forward to making a second leap forward after some feedback.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list