[KinoSearch] removing position code from scorer subclasses
Marvin Humphrey
marvin at rectangular.com
Sun Jul 15 20:49:14 PDT 2007
On Jul 15, 2007, at 12:41 PM, Nathan Kurz wrote:
> u32_t
> ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
> {
> Scorer *element_scorer = self->subscorer;
>
> while (1) {
> // advance to the next document that satisfies scorer
> target_doc = Scorer_Advance(element_scorer, target_doc);
>
> // stop if there are no more matching documents
> if (target_doc == DOC_NOT_FOUND) break;
>
> // use this doc if the positions match the offsets
> if (Scorer_Find_Phrases(self)) break;
>
> // otherwise try the next doc
> target_doc++;
> }
>
> return target_doc;
> }
That part looks straightforward. It resembles the abstract
PhraseScorer base class in Lucene, which has two subclasses,
ExactPhraseScorer and SloppyPhraseScorer. The only difference
between the two is how they implement the abstract method
PhraseScorer.phraseFreq().
>
> u32_t
> ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
> {
> Positions *positions = scorer->positions;
> chy_u32_t *offsets = self->offsets;
>
> chy_u32_t *position;
> chy_u32_t length = phrase_occurs(positions, offsets, &position);
>
> // no phrases found here
> if (length == 0) return 0;
>
> chy_u32_t num_phrases = 1;
> if (self->want_positions) {
> // record the phrase we already found
> Scorer_Add_Occurrence(self, position, length);
> // find further occurrences of the phrase
> while (length = phrase_occurs(positions, offsets, &position)) {
> Scorer_Add_Occurrence(self, position, length);
> num_phrases++;
> }
> }
>
> return num_phrases;
> }
This part resembles Lucene's SpanNearQuery, particularly the
Scorer_Add_Occurrence part, which looks like a Span getting stored away.
> My instinct this morning (it changes often) is that I should go
> through and remove the position code as you suggested, and then write
> a ProximityScorer that will serve my purposes but not more.
OK. I think by now you have a pretty good idea of how things are put
together down there. Please forward your code as it progresses.
> we can reconsider the generalized solution at a later date.
My main concern is that you're currently forced to violate private
API in order to do what you need to do. The challenge is to come up
with an infrastructure that will allow you to publish your
generalized solution as a separate CPAN distro.
I have some ideas as to how to pull that off. We can air them out by
test-porting some of the less ambitious query/scorer classes from
Lucene, like ConstantScoringRangeQuery.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list