[KinoSearch] removing position code from scorer subclasses

Marvin Humphrey marvin at rectangular.com
Sun Jul 15 20:49:14 PDT 2007


On Jul 15, 2007, at 12:41 PM, Nathan Kurz wrote:

> u32_t
> ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
> {
>    Scorer *element_scorer = self->subscorer;
>
>    while (1) {
>        // advance to the next document that satisfies scorer
>        target_doc = Scorer_Advance(element_scorer, target_doc);
>
>        // stop if there are no more matching documents
>        if (target_doc == DOC_NOT_FOUND) break;
>
>        // use this doc if the positions match the offsets
>        if (Scorer_Find_Phrases(self)) break;
>
>        // otherwise try the next doc
>        target_doc++;
>    }
>
>    return target_doc;
> }

That part looks straightforward.  It resembles the abstract  
PhraseScorer base class in Lucene, which has two subclasses,  
ExactPhraseScorer and SloppyPhraseScorer.  The only difference  
between the two is how they implement the abstract method  
PhraseScorer.phraseFreq().

>
> u32_t
> ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
> {
>    Positions *positions = scorer->positions;
>    chy_u32_t *offsets = self->offsets;
>
>    chy_u32_t *position;
>    chy_u32_t length = phrase_occurs(positions, offsets, &position);
>
>    // no phrases found here
>    if (length == 0) return 0;
>
>    chy_u32_t num_phrases = 1;
>    if (self->want_positions) {
>        // record the phrase we already found
>        Scorer_Add_Occurrence(self, position, length);
>        // find further occurrences of the phrase
>        while (length = phrase_occurs(positions, offsets, &position)) {
>            Scorer_Add_Occurrence(self, position, length);
>            num_phrases++;
>        }
>    }
>
>    return num_phrases;
> }

This part resembles Lucene's SpanNearQuery, particularly the  
Scorer_Add_Occurrence part, which looks like a Span getting stored away.

> My instinct this morning (it changes often) is that I should go
> through and remove the position code as you suggested, and then write
> a ProximityScorer that will serve my purposes but not more.

OK.  I think by now you have a pretty good idea of how things are put  
together down there.  Please forward your code as it progresses.

> we can reconsider the generalized solution at a later date.

My main concern is that you're currently forced to violate private  
API in order to do what you need to do.  The challenge is to come up  
with an infrastructure that will allow you to publish your  
generalized solution as a separate CPAN distro.

I have some ideas as to how to pull that off.  We can air them out by  
test-porting some of the less ambitious query/scorer classes from  
Lucene, like ConstantScoringRangeQuery.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list