[KinoSearch] more abstract interfaces to kinosearch
Marvin Humphrey
marvin at rectangular.com
Tue Jul 10 01:52:58 PDT 2007
On Jul 3, 2007, at 1:06 PM, Nathan Kurz wrote:
> I'm playing blue-sky here.
I appreciate the challenges. :)
>> The MultiSearcher object knows how common a term is across all
>> the sub-indexes *combined*. That's information that each individual
>> IndexReader doesn't have access to.
>
> Wow, so you are making multiple round trips to the search servers?
Yes.
There's one call to each node for the doc freqs in MultiSearcher-
>create_weight:
# get an aggregated doc_freq for each term
my @aggregated_doc_freqs = (0) x scalar @terms;
for my $i ( 0 .. $#$searchables ) {
my $doc_freqs = $searchables->[$i]->doc_freqs( \@terms );
for my $j ( 0 .. $#terms ) {
$aggregated_doc_freqs[$j] += $doc_freqs->[$j];
}
}
Then there's one call to each searcher for collect().
Then individual doc content gets fetched from the searcher that owns
them, for however many hits are to be displayed.
> I was guessing that a local approximation would suffice, so that each
> server would be independent.
That would be close some of the time and wildly inaccurate some of
the time.
> I guess that's only way to get truly
> accurate weights, but I wonder if the increased precision is worth the
> cost.
Very few bits are going over the wire until doc content starts flying.
> each set of cooperating
> scorers shares a ScorerData object, which according to the needs of
> that scoring system may or may not look like the current Similarity
> object.
You could do this by extending Similarity, which I find conceptually
appealing. But I can see why you'd find it cleaner to start fresh.
>> > no delayed inits,
>>
>> Any of these that can go away, I say good riddance. They are usually
>> artifacts of a simplified external API, though. The delayed init
>> spares the caller from the responsibility of invoking some init
>> routine manually before a loop begins.
>
> Yes, in particular it's because of the incremental formation of
> BooleanScorer. Rather than requiring an init call, I'm thinking that
> we can just require that all the clauses be passed to the constructor
> like the subscorers currently do.
Every place I've done that I've regretted it, and in some places
changed it -- e.g. Highlighter 0.15 vs 0.20_04.
http://www.rectangular.com/kinosearch/docs/stable/KinoSearch/
Highlight/Highlighter.html
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Highlight/
Highlighter.html
Jamming everything into your constructor isn't good design. Pretty
soon you want to add another parameter, and the thing starts to get
gnarly....
> Nothing major, and nothing tested. I think there is a small gain by
> having Scorer_Advance return a doc number directly rather than a
> boolean, obviating the need for a follow-up call to Scorer_Doc.
Doc numbers begin at 0, so that's not going to work without making
some changes. Might be worth trying, but a lot of stuff that depends
on the current behavior of IndexReader->max_doc would go haywire.
> A better worst-case example would have been:
>
> (expensive-tally OR expensive-tally) AND (expensive-tally OR
> expensive-tally)
>
> Assume the pessimal case where the first and-clause matches only odd
> documents, and the second and-clause matches only even. In the
> current code, we'd still be performing a lot of expensive Tally's even
> though in theory we don't need to perform any.
>
> I'm still not sure if this makes a difference, though, and whether or
> not I should try to keep Tally clearly distinct and after Advance. So
> long as finding a match is more expensive than tallying it, it's not
> going to make much of a difference.
I think ORScorer is the only place this is going to apply. It's a
valid concern, but the current BooleanScorer proceeds doc-at-a-time,
which allows it to call Scorer_Skip_To() on its subscorers. That's
not the case for the BooleanScorer used in KS 0.15.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list