[KinoSearch] more abstract interfaces to kinosearch

Marvin Humphrey marvin at rectangular.com
Tue Jul 10 01:52:58 PDT 2007


On Jul 3, 2007, at 1:06 PM, Nathan Kurz wrote:

> I'm playing blue-sky here.

I appreciate the challenges.  :)

>> The MultiSearcher object knows how common a term is across all
>> the sub-indexes *combined*.  That's information that each individual
>> IndexReader doesn't have access to.
>
> Wow, so you are making multiple round trips to the search servers?

Yes.

There's one call to each node for the doc freqs in MultiSearcher- 
 >create_weight:

     # get an aggregated doc_freq for each term
     my @aggregated_doc_freqs = (0) x scalar @terms;
     for my $i ( 0 .. $#$searchables ) {
         my $doc_freqs = $searchables->[$i]->doc_freqs( \@terms );
         for my $j ( 0 .. $#terms ) {
             $aggregated_doc_freqs[$j] += $doc_freqs->[$j];
         }
     }

Then there's one call to each searcher for collect().

Then individual doc content gets fetched from the searcher that owns  
them, for however many hits are to be displayed.


> I was guessing that a local approximation would suffice, so that  each
> server would be independent.

That would be close some of the time and wildly inaccurate some of  
the time.

> I guess that's only way to get truly
> accurate weights, but I wonder if the increased precision is worth the
> cost.

Very few bits are going over the wire until doc content starts flying.

>  each set of cooperating
> scorers shares a ScorerData object, which according to the needs of
> that scoring system may or may not look like the current Similarity
> object.

You could do this by extending Similarity, which I find conceptually  
appealing.  But I can see why you'd find it cleaner to start fresh.

>> > no delayed inits,
>>
>> Any of these that can go away, I say good riddance.  They are usually
>> artifacts of a simplified external API, though.  The delayed init
>> spares the caller from the responsibility of invoking some init
>> routine manually before a loop begins.
>
> Yes, in particular it's because of the incremental formation of
> BooleanScorer.  Rather than requiring an init call, I'm thinking that
> we can just require that all the clauses be passed to the constructor
> like the subscorers currently do.

Every place I've done that I've regretted it, and in some places  
changed it -- e.g. Highlighter 0.15 vs 0.20_04.

http://www.rectangular.com/kinosearch/docs/stable/KinoSearch/ 
Highlight/Highlighter.html
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Highlight/ 
Highlighter.html

Jamming everything into your constructor isn't good design.  Pretty  
soon you want to add another parameter, and the thing starts to get  
gnarly....

> Nothing major, and nothing tested.  I think there is a small gain by
> having Scorer_Advance return a doc number directly rather than a
> boolean, obviating the need for a follow-up call to Scorer_Doc.

Doc numbers begin at 0, so that's not going to work without making  
some changes.  Might be worth trying, but a lot of stuff that depends  
on the current behavior of IndexReader->max_doc would go haywire.

>   A better worst-case example would have been:
>
> (expensive-tally OR expensive-tally) AND (expensive-tally OR  
> expensive-tally)
>
> Assume the pessimal case where the first and-clause matches only odd
> documents, and the second and-clause matches only even.  In the
> current code, we'd still be performing a lot of expensive Tally's even
> though in theory we don't need to perform any.
>
> I'm still not sure if this makes a difference, though, and whether or
> not I should try to keep Tally clearly distinct and after Advance.  So
> long as finding a match is more expensive than tallying it, it's not
> going to make much of a difference.

I think ORScorer is the only place this is going to apply.  It's a  
valid concern, but the current BooleanScorer proceeds doc-at-a-time,  
which allows it to call Scorer_Skip_To() on its subscorers.  That's  
not the case for the BooleanScorer used in KS 0.15.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list