[KinoSearch] subclassing term scorers

Marvin Humphrey marvin at rectangular.com
Wed Jul 18 19:35:25 PDT 2007


On Jul 18, 2007, at 6:27 PM, Nathan Kurz wrote:

> I've rearranged my responses to emphasize our agreement :).

Well done! ;)

>> Lastly, Posting and PostingList also happen to align well with IR
>> theory, making for what seems to me is a more coherent conceptual OO
>> model than Lucene's TermDocs/TermPositions.
>
> Yes, I agree.  I'm not intimately familiar with Lucene's model apart
> from via yours, but Posting and PostingList inherently make sense.

In the TermDocs/TermPositions model, the traits are added to the  
iterator itself.

    while (termDocs.next()) {
      system.out.println("DOC: "  + termDocs.doc());
      system.out.println("FREQ: " + termDocs.freq());
    }

    while (termPositions.next()) {
      system.out.println("DOC: "  + termPositions.doc());
      int freq = termPositions.freq());
      system.out.println("FREQ: " + freq);
      while (freq--) {
        int position = termPositions.nextPosition();
        system.out.println("POS: " + position);
        if (termPositions.isPayloadAvailable()) {
          byte[] payload = termPositions.getPayload(null, 0);
          printPayloadSomeHow(payload);
        }
      }
    }

There isn't an object which represents a posting.

Another significant difference is that Lucene iterates over positions  
one at a time via nextPosition(), while KS loads them all into memory  
at once.

>>    * a write method
>>    * a read method
>>    * a make_scorer method
>>    * a TermScorer subclass that overrides Scorer_Tally
>
> Here's where we separate a little.  I'd like to make it even simpler,
> and require only that it define a read method (and presumably a write
> method, although I've thought very little about that side).

Yes, you could do that.  Presumably, the subclass would interpret the  
same postings file data differently somehow from the parent class.

> A new scorer could be defined to make use of new information in new  
> Posting,
> but this would be optional.

You're right.  In general that would work, provided that the subclass  
was serious about fulfilling the parent class's interface.

> A subclassed Posting can continue to use
> the Scorer used by its parent.  Thus if if ScorePosting is a
> descendant of MatchPosting, MatchPostingScorer can call
> ScorePosting->read() and end up with a Posting it can handle.

I can't think of a reason why this wouldn't work.  Boilerplater  
implements single inheritance only, a very limited OO model.  There's  
a little trickiness in there -- RichPosting's file format doesn't  
"inherit" from ScorePosting's, for instance...

   <doc, freq, shared_boost, <position>+>+
   <doc, freq, <position, boost>+>+

... and the generated posting->impact would presumably differ (that's  
the whole point of RichPosting after all).  But the C structs would  
be compatible.

>> The intent is that each Posting subclass will have a fixed
>> association with a corresponding TermScorer subclass.  You're not
>> supposed to be able to override that association without additional
>> subclassing.
>
> This I don't like.   I can see how you got here, but I think there is
> a better solution: the TermScorers depend only on the format of the
> Posting struct, and Posting->read() is the sole point of conversion
> from Index as file to Posting as object.

Well put.  You've persuaded me.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list