[KinoSearch] subclassing term scorers
Nathan Kurz
nate at verse.com
Wed Jul 18 18:27:18 PDT 2007
On 7/18/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:
I've rearranged my responses to emphasize our agreement :).
> Lastly, Posting and PostingList also happen to align well with IR
> theory, making for what seems to me is a more coherent conceptual OO
> model than Lucene's TermDocs/TermPositions.
Yes, I agree. I'm not intimately familiar with Lucene's model apart
from via yours, but Posting and PostingList inherently make sense.
> Furthermore, each Posting subclass shall wholly define a posting file
> format. This is very different from Lucene.
Yes, this is a great improvement.
> Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"
> and such, we point a read function at the postings file which knows
> *exactly* how to decode it.
Yes, this is a win. Although it would be possible to work trait by
trait, having a single reader per format is better for simplicity,
efficiency, and maintainability.
> The next step along the path as I
> see it is to refactor Posting and the classes that touch it so that
> writing a Posting subclass is as simple as defining...
>
> * a write method
> * a read method
> * a make_scorer method
> * a TermScorer subclass that overrides Scorer_Tally
Here's where we separate a little. I'd like to make it even simpler,
and require only that it define a read method (and presumably a write
method, although I've thought very little about that side). A new
scorer could be defined to make use of new information in new Posting,
but this would be optional. A subclassed Posting can continue to use
the Scorer used by its parent. Thus if if ScorePosting is a
descendant of MatchPosting, MatchPostingScorer can call
ScorePosting->read() and end up with a Posting it can handle.
This is essentially how PhraseScorer works now, and apart from wanting
some better type-checking, I like this. If I write a
CustomScorePosting class that adds a field to the ScorePosting struct,
PhraseScorer doesn't care, and can continue to directly access posting
struct as a ScorePosting. This is good, because I don't want to have
to write a custom PhraseScorer for every custom Posting class I come
up with. In this view, the purpose of the reader method is to return
a filled in Posting struct for use by a Term or Phrase scorer.
> Here's a rundown of how TermScorer subclasses might interact with
> their corresponding Posting subclasses:
Yes, those, like that.
> In all cases, these Posting subclasses would meet the minimum
> criteria of supplying a document number, but that would be all they
> had in common. A PartOfSpeech term scorer wouldn't really know what
> to make of a GraphPosting.
Yes to the specific example, but I'd like to take advantage of the
inheritance hierarchy when it exists. Thus a MatchPostingScorer would
work just fine if given a ScorePosting (since ScorePosting is a
MatchPosting), but not vice versa. So while a PartOfSpeechScorer
wouldn't try to handle a GraphPosting (presuming PartOfSpeechPosting
is not a GraphPosting), a scorer that wants only a MatchPosting would
likely be able to handle both.
> The intent is that each Posting subclass will have a fixed
> association with a corresponding TermScorer subclass. You're not
> supposed to be able to override that association without additional
> subclassing.
This I don't like. I can see how you got here, but I think there is
a better solution: the TermScorers depend only on the format of the
Posting struct, and Posting->read() is the sole point of conversion
from Index as file to Posting as object. Thus so long as the custom
Posting is a subclass of a standard Posting format, all the scorers
that worked with that parent will work with the subclass.
On the bright side, I'm now confident enough that this will work that
I think we can talk about it later, and concentrate now on how to make
positions work.
> You know, as an alternative to deleting all this positional stuff, we
> could more or less finish it off. :)
I was planning to try to give you a patch removing it this evening,
but I making it work would certainly feel more rewarding. I'll send
you another email later tonight with further thoughts on this.
Nathan Kurz
nate at verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list