[KinoSearch] subclassing term scorers
Marvin Humphrey
marvin at rectangular.com
Wed Jul 18 19:35:25 PDT 2007
On Jul 18, 2007, at 6:27 PM, Nathan Kurz wrote:
> I've rearranged my responses to emphasize our agreement :).
Well done! ;)
>> Lastly, Posting and PostingList also happen to align well with IR
>> theory, making for what seems to me is a more coherent conceptual OO
>> model than Lucene's TermDocs/TermPositions.
>
> Yes, I agree. I'm not intimately familiar with Lucene's model apart
> from via yours, but Posting and PostingList inherently make sense.
In the TermDocs/TermPositions model, the traits are added to the
iterator itself.
while (termDocs.next()) {
system.out.println("DOC: " + termDocs.doc());
system.out.println("FREQ: " + termDocs.freq());
}
while (termPositions.next()) {
system.out.println("DOC: " + termPositions.doc());
int freq = termPositions.freq());
system.out.println("FREQ: " + freq);
while (freq--) {
int position = termPositions.nextPosition();
system.out.println("POS: " + position);
if (termPositions.isPayloadAvailable()) {
byte[] payload = termPositions.getPayload(null, 0);
printPayloadSomeHow(payload);
}
}
}
There isn't an object which represents a posting.
Another significant difference is that Lucene iterates over positions
one at a time via nextPosition(), while KS loads them all into memory
at once.
>> * a write method
>> * a read method
>> * a make_scorer method
>> * a TermScorer subclass that overrides Scorer_Tally
>
> Here's where we separate a little. I'd like to make it even simpler,
> and require only that it define a read method (and presumably a write
> method, although I've thought very little about that side).
Yes, you could do that. Presumably, the subclass would interpret the
same postings file data differently somehow from the parent class.
> A new scorer could be defined to make use of new information in new
> Posting,
> but this would be optional.
You're right. In general that would work, provided that the subclass
was serious about fulfilling the parent class's interface.
> A subclassed Posting can continue to use
> the Scorer used by its parent. Thus if if ScorePosting is a
> descendant of MatchPosting, MatchPostingScorer can call
> ScorePosting->read() and end up with a Posting it can handle.
I can't think of a reason why this wouldn't work. Boilerplater
implements single inheritance only, a very limited OO model. There's
a little trickiness in there -- RichPosting's file format doesn't
"inherit" from ScorePosting's, for instance...
<doc, freq, shared_boost, <position>+>+
<doc, freq, <position, boost>+>+
... and the generated posting->impact would presumably differ (that's
the whole point of RichPosting after all). But the C structs would
be compatible.
>> The intent is that each Posting subclass will have a fixed
>> association with a corresponding TermScorer subclass. You're not
>> supposed to be able to override that association without additional
>> subclassing.
>
> This I don't like. I can see how you got here, but I think there is
> a better solution: the TermScorers depend only on the format of the
> Posting struct, and Posting->read() is the sole point of conversion
> from Index as file to Posting as object.
Well put. You've persuaded me.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list