[KinoSearch] subclassing term scorers

Nathan Kurz nate at verse.com
Wed Jul 18 18:27:18 PDT 2007


On 7/18/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:

I've rearranged my responses to emphasize our agreement :).

> Lastly, Posting and PostingList also happen to align well with IR
> theory, making for what seems to me is a more coherent conceptual OO
> model than Lucene's TermDocs/TermPositions.

Yes, I agree.  I'm not intimately familiar with Lucene's model apart
from via yours, but Posting and PostingList inherently make sense.

> Furthermore, each Posting subclass shall wholly define a posting file
> format.  This is very different from Lucene.

Yes, this is a great improvement.

> Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"
> and such, we point a read function at the postings file which knows
> *exactly* how to decode it.

Yes, this is a win.   Although it would be possible to work trait by
trait, having a single reader per format is better for simplicity,
efficiency, and maintainability.

> The next step along the path as I
> see it is to refactor Posting and the classes that touch it so that
> writing a Posting subclass is as simple as defining...
>
>    * a write method
>    * a read method
>    * a make_scorer method
>    * a TermScorer subclass that overrides Scorer_Tally

Here's where we separate a little.  I'd like to make it even simpler,
and require only that it define a read method (and presumably a write
method, although I've thought very little about that side).   A new
scorer could be defined to make use of new information in new Posting,
but this would be optional.  A subclassed Posting can continue to use
the Scorer used by its parent.  Thus if if ScorePosting is a
descendant of MatchPosting, MatchPostingScorer can call
ScorePosting->read() and end up with a Posting it can handle.

This is essentially how PhraseScorer works now, and apart from wanting
some better type-checking, I like this.   If I write a
CustomScorePosting class that adds a field to the ScorePosting struct,
PhraseScorer doesn't care, and can continue to directly access posting
struct as a ScorePosting.  This is good, because I don't want to have
to write a custom PhraseScorer for every custom Posting class I come
up with.  In this view, the purpose of the reader method is to return
a filled in Posting struct for use by a Term or Phrase scorer.

> Here's a rundown of how TermScorer subclasses might interact with
> their corresponding Posting subclasses:

Yes, those, like that.

> In all cases, these Posting subclasses would meet the minimum
> criteria of supplying a document number, but that would be all they
> had in common.  A PartOfSpeech term scorer wouldn't really know what
> to make of a GraphPosting.

Yes to the specific example, but I'd like to take advantage of the
inheritance hierarchy when it exists.  Thus a MatchPostingScorer would
work just fine if given a ScorePosting (since ScorePosting is a
MatchPosting), but not vice versa.  So while a PartOfSpeechScorer
wouldn't try to handle a GraphPosting (presuming PartOfSpeechPosting
is not a GraphPosting), a scorer that wants only a MatchPosting would
likely be able to handle both.

> The intent is that each Posting subclass will have a fixed
> association with a corresponding TermScorer subclass.  You're not
> supposed to be able to override that association without additional
> subclassing.

This I don't like.   I can see how you got here, but I think there is
a better solution: the TermScorers depend only on the format of the
Posting struct, and Posting->read() is the sole point of conversion
from Index as file to Posting as object.  Thus so long as the custom
Posting is a subclass of a standard Posting format, all the scorers
that worked with that parent will work with the subclass.

On the bright side, I'm now confident enough that this will work that
I think we can talk about it later, and concentrate now on how to make
positions work.

> You know, as an alternative to deleting all this positional stuff, we
> could more or less finish it off. :)

I was planning to try to give you a patch removing it this evening,
but I making it work would certainly feel more rewarding.   I'll send
you another email later tonight with further thoughts on this.

Nathan Kurz
nate at verse.com



More information about the kinosearch mailing list