[KinoSearch] removing position code from scorer subclasses

Nathan Kurz nate at verse.com
Sun Jul 15 12:41:41 PDT 2007


On 7/15/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> PhraseScorer is easy.
>
>  bool_t
>  PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
>  {
>    /* if ($self->SUPER::skip_to($target)) */
>    if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
>       build_prox(self);
>       return true;
>    }
>    return false;
>  }
>
>   http://www.refactoring.com/catalog/replaceConditionalWithPolymorphism.html

Possibly useful for discussion, here's the not-yet-compiled code I was
working on for this.   Tally would call Scorer_Find_Phrases (or an
inlined version) as it needed to, checking first if it had already
been done. I offer it not because it would actually work, but just to
say that it's not quite as pathological as the refactoring example you
linked to.  :)

------------------------------------------------------------------------------------------------

u32_t
ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
{
    Scorer *element_scorer = self->subscorer;

    while (1) {
        // advance to the next document that satisfies scorer
        target_doc = Scorer_Advance(element_scorer, target_doc);

        // stop if there are no more matching documents
        if (target_doc == DOC_NOT_FOUND) break;

        // use this doc if the positions match the offsets
        if (Scorer_Find_Phrases(self)) break;

        // otherwise try the next doc
        target_doc++;
    }

    return target_doc;
}


u32_t
ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
{
    Positions *positions = scorer->positions;
    chy_u32_t *offsets = self->offsets;

    chy_u32_t *position;
    chy_u32_t length = phrase_occurs(positions, offsets, &position);

    // no phrases found here
    if (length == 0) return 0;

    chy_u32_t num_phrases = 1;
    if (self->want_positions) {
        // record the phrase we already found
        Scorer_Add_Occurrence(self, position, length);
        // find further occurrences of the phrase
        while (length = phrase_occurs(positions, offsets, &position)) {
            Scorer_Add_Occurrence(self, position, length);
            num_phrases++;
        }
    }

    return num_phrases;
}

-------------------------------------------------------------------------------

My instinct this morning (it changes often) is that I should go
through and remove the position code as you suggested, and then write
a ProximityScorer that will serve my purposes but not more.  Rather
than being the difficult generalized position scorer, this would look
more like an AndNotScorer hybridized with a PhraseScorer.

The And portion would contain ShortCircuitOrScorer's (need to write),
which in turn would contain term or phrase scorers.  Since the
ShortCircuitOrScorer would stop after the first subclause match, it
will only have a single position array to report.  And since the
AndClauses will be directly embedded in the ProximityScorer, I can
peek directly at their positions like PhraseScorer does.

Rather than worrying about properly figuring the positions for the
phrases, the position scorer will just bail if it sees a subphrase,
under the likely correct assumption that in the very rare case that a
user is searching for a phrase the position bonus isn't that
important.  I haven't fully thought it out, but I think it should work
for now, and we can reconsider the generalized solution at a later
date.

Nathan Kurz
nate at verse.com



More information about the kinosearch mailing list