[KinoSearch] removing position code from scorer subclasses
Nathan Kurz
nate at verse.com
Sun Jul 15 12:41:41 PDT 2007
On 7/15/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> PhraseScorer is easy.
>
> bool_t
> PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
> {
> /* if ($self->SUPER::skip_to($target)) */
> if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
> build_prox(self);
> return true;
> }
> return false;
> }
>
> http://www.refactoring.com/catalog/replaceConditionalWithPolymorphism.html
Possibly useful for discussion, here's the not-yet-compiled code I was
working on for this. Tally would call Scorer_Find_Phrases (or an
inlined version) as it needed to, checking first if it had already
been done. I offer it not because it would actually work, but just to
say that it's not quite as pathological as the refactoring example you
linked to. :)
------------------------------------------------------------------------------------------------
u32_t
ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
{
Scorer *element_scorer = self->subscorer;
while (1) {
// advance to the next document that satisfies scorer
target_doc = Scorer_Advance(element_scorer, target_doc);
// stop if there are no more matching documents
if (target_doc == DOC_NOT_FOUND) break;
// use this doc if the positions match the offsets
if (Scorer_Find_Phrases(self)) break;
// otherwise try the next doc
target_doc++;
}
return target_doc;
}
u32_t
ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
{
Positions *positions = scorer->positions;
chy_u32_t *offsets = self->offsets;
chy_u32_t *position;
chy_u32_t length = phrase_occurs(positions, offsets, &position);
// no phrases found here
if (length == 0) return 0;
chy_u32_t num_phrases = 1;
if (self->want_positions) {
// record the phrase we already found
Scorer_Add_Occurrence(self, position, length);
// find further occurrences of the phrase
while (length = phrase_occurs(positions, offsets, &position)) {
Scorer_Add_Occurrence(self, position, length);
num_phrases++;
}
}
return num_phrases;
}
-------------------------------------------------------------------------------
My instinct this morning (it changes often) is that I should go
through and remove the position code as you suggested, and then write
a ProximityScorer that will serve my purposes but not more. Rather
than being the difficult generalized position scorer, this would look
more like an AndNotScorer hybridized with a PhraseScorer.
The And portion would contain ShortCircuitOrScorer's (need to write),
which in turn would contain term or phrase scorers. Since the
ShortCircuitOrScorer would stop after the first subclause match, it
will only have a single position array to report. And since the
AndClauses will be directly embedded in the ProximityScorer, I can
peek directly at their positions like PhraseScorer does.
Rather than worrying about properly figuring the positions for the
phrases, the position scorer will just bail if it sees a subphrase,
under the likely correct assumption that in the very rare case that a
user is searching for a phrase the position bonus isn't that
important. I haven't fully thought it out, but I think it should work
for now, and we can reconsider the generalized solution at a later
date.
Nathan Kurz
nate at verse.com
More information about the kinosearch
mailing list