[KinoSearch] subclassing term scorers

Marvin Humphrey marvin at rectangular.com
Wed Jul 18 08:49:46 PDT 2007


On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:

> The problem is that while it is easy to subclass a term scorer, it's
> difficult to get that subclassed scorer to be used.

The intent is that each Posting subclass will have a fixed  
association with a corresponding TermScorer subclass.  You're not  
supposed to be able to override that association without additional  
subclassing.

Furthermore, each Posting subclass shall wholly define a posting file  
format.  This is very different from Lucene.  In Lucene, the code to  
read postings data is spread out over many classes, resulting in a  
tight binding between code base and file format which impedes  
extensibility.  The postings data itself is spread out over 2-3 index  
files per field -- .frq and .prx, and optionally .fNUM -- whose  
formats have slowly become mangled by the persistent nibbling of  
feeping creatures.  Take a look at <http://lucene.apache.org/java/ 
docs/fileformats.html#Frequencies> and try to imagine writing actual  
code to support that spec. :)

The introduction of the Posting class in KS and the ironclad  
association of a posting_type with a particular field is an attempt  
to disentangle that situation by applying the refactoring technique I  
referenced earlier, "replace conditionals with polymorphism".   
Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"  
and such, we point a read function at the postings file which knows  
*exactly* how to decode it.

Posting currently works great for internal use, but it's not quite  
ready for public subclassing yet.  The next step along the path as I  
see it is to refactor Posting and the classes that touch it so that  
writing a Posting subclass is as simple as defining...

   * a write method
   * a read method
   * a make_scorer method
   * a TermScorer subclass that overrides Scorer_Tally

In either C or Perl. :)

Another advantage of the unified posting file design is superior  
locality of reference compared with Lucene's spread-out posting  
files.  This is an engineering tradeoff, as when matching only doc  
number on a ScorePosting field, Lucene has to plow through less  
data... but when positions are needed, KS does fewer hard disk  
seeks.  (: It would seem natural for us to exploit positions data  
fully, since we're reading it regardless -- which is one reason I'm  
enthusiastic about your positional work. :)

Lastly, Posting and PostingList also happen to align well with IR  
theory, making for what seems to me is a more coherent conceptual OO  
model than Lucene's TermDocs/TermPositions.  (A "posting" is one  
"term" indexing one "document", and each "term" in an index has a  
"posting list" of documents which it indexes - see <http:// 
www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ 
IRTheory.html>.)

Here's a rundown of how TermScorer subclasses might interact with  
their corresponding Posting subclasses:

   * If a field uses MatchPosting, then iterating over a posting
     list is really just iterating over a set of document numbers
     (transported via MatchPosting objects).
   * ScorePosting adds position, freq, and boost information.
   * RichPosting is like ScorePosting with enhanced boost info.
   * A "PartOfSpeechPosting" might consist of a document number
     and an integer representing "verb", "noun", etc -- allowing
     someone to perform linguistic analysis of a document
     collection.
  * An MMapMatchPosting might encode document numbers using
    native 32-bit integers rather than delta vints.
  * A proprietary GraphPosting might include a node_id and a
    node_type, facilitating the kind of search engine described in
    that FaceBook blog post -- weighting the document score
    according to data looked up in an external structure.

In all cases, these Posting subclasses would meet the minimum  
criteria of supplying a document number, but that would be all they  
had in common.  A PartOfSpeech term scorer wouldn't really know what  
to make of a GraphPosting.

> I could bypass PostingList->make_scorer, as PhraseScorer does, and
> work directly on the Posting.  This feels presumptive, and reliant on
> Posting internals.  I'm confused by your comfort in having
> PhraseScorer do so.

The only way to get a PhraseScorer is via the factory method  
PhraseWeight->make_scorer, which will return undef unless it detects  
that the posting subclass isa ScorePosting, i.e. it makes positional  
data available.  We're safe.

> Either it would
> need an ISA() check and a cast, or some way of interrogating the
> Scorer as to whether certain features are available.)

I don't believe that will be necessary.  We just need to make sure  
that whatever technique we use for inquiring about the positional  
data is safe.  Tally has a num_sproxen member indicating how many  
ScoreProx objects it has.  That number would always be 0 for a Tally  
supplied by a MatchPostingScorer.

You know, as an alternative to deleting all this positional stuff, we  
could more or less finish it off. :)  You and I both need positional  
data.  TermScorer provides it.  PhraseScorer provides it modulo a  
couple glaring, easily-squashed bugs.  We just need the BooleanScorer  
components to provide it, too.

The hardest part of what I want to do is figuring out exactly how  
positional data should contribute to the score.  It's going to be  
tricky enough that I think it will require some trial and error.   
However, we can  defer that task.  The positional data itself  
probably won't change, so we can get the positional data engine  
working and just not connect it up.

One thing I'm concerned about is RAM footprint.  Let's try and come  
up with a worst case scenario.

    'the OR (and OR (it OR is)))'

KS isn't optimized for book-sized documents, but let's say that's  
what's in an index, and the query above hits "War and Peace".  Each  
TermScorer will have a Posting object that grows to some large number  
of KiB or even MiB to accommodate so many positions.  That's OK, but  
is there any danger of geometric growth as our position-resolving  
algorithm recurses?

> I'd like to have a path that would
> continue to work even as I move to other posting formats, particularly
> the mmap() approach I mentioned.

That's not gonna fly.  You'll need dedicated Posting subclasses for  
mmap.  MMapMatchPosting, MMapScorePosting, etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list