[KinoSearch] subclassing term scorers
Marvin Humphrey
marvin at rectangular.com
Wed Jul 18 08:49:46 PDT 2007
On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:
> The problem is that while it is easy to subclass a term scorer, it's
> difficult to get that subclassed scorer to be used.
The intent is that each Posting subclass will have a fixed
association with a corresponding TermScorer subclass. You're not
supposed to be able to override that association without additional
subclassing.
Furthermore, each Posting subclass shall wholly define a posting file
format. This is very different from Lucene. In Lucene, the code to
read postings data is spread out over many classes, resulting in a
tight binding between code base and file format which impedes
extensibility. The postings data itself is spread out over 2-3 index
files per field -- .frq and .prx, and optionally .fNUM -- whose
formats have slowly become mangled by the persistent nibbling of
feeping creatures. Take a look at <http://lucene.apache.org/java/
docs/fileformats.html#Frequencies> and try to imagine writing actual
code to support that spec. :)
The introduction of the Posting class in KS and the ironclad
association of a posting_type with a particular field is an attempt
to disentangle that situation by applying the refactoring technique I
referenced earlier, "replace conditionals with polymorphism".
Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"
and such, we point a read function at the postings file which knows
*exactly* how to decode it.
Posting currently works great for internal use, but it's not quite
ready for public subclassing yet. The next step along the path as I
see it is to refactor Posting and the classes that touch it so that
writing a Posting subclass is as simple as defining...
* a write method
* a read method
* a make_scorer method
* a TermScorer subclass that overrides Scorer_Tally
In either C or Perl. :)
Another advantage of the unified posting file design is superior
locality of reference compared with Lucene's spread-out posting
files. This is an engineering tradeoff, as when matching only doc
number on a ScorePosting field, Lucene has to plow through less
data... but when positions are needed, KS does fewer hard disk
seeks. (: It would seem natural for us to exploit positions data
fully, since we're reading it regardless -- which is one reason I'm
enthusiastic about your positional work. :)
Lastly, Posting and PostingList also happen to align well with IR
theory, making for what seems to me is a more coherent conceptual OO
model than Lucene's TermDocs/TermPositions. (A "posting" is one
"term" indexing one "document", and each "term" in an index has a
"posting list" of documents which it indexes - see <http://
www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/
IRTheory.html>.)
Here's a rundown of how TermScorer subclasses might interact with
their corresponding Posting subclasses:
* If a field uses MatchPosting, then iterating over a posting
list is really just iterating over a set of document numbers
(transported via MatchPosting objects).
* ScorePosting adds position, freq, and boost information.
* RichPosting is like ScorePosting with enhanced boost info.
* A "PartOfSpeechPosting" might consist of a document number
and an integer representing "verb", "noun", etc -- allowing
someone to perform linguistic analysis of a document
collection.
* An MMapMatchPosting might encode document numbers using
native 32-bit integers rather than delta vints.
* A proprietary GraphPosting might include a node_id and a
node_type, facilitating the kind of search engine described in
that FaceBook blog post -- weighting the document score
according to data looked up in an external structure.
In all cases, these Posting subclasses would meet the minimum
criteria of supplying a document number, but that would be all they
had in common. A PartOfSpeech term scorer wouldn't really know what
to make of a GraphPosting.
> I could bypass PostingList->make_scorer, as PhraseScorer does, and
> work directly on the Posting. This feels presumptive, and reliant on
> Posting internals. I'm confused by your comfort in having
> PhraseScorer do so.
The only way to get a PhraseScorer is via the factory method
PhraseWeight->make_scorer, which will return undef unless it detects
that the posting subclass isa ScorePosting, i.e. it makes positional
data available. We're safe.
> Either it would
> need an ISA() check and a cast, or some way of interrogating the
> Scorer as to whether certain features are available.)
I don't believe that will be necessary. We just need to make sure
that whatever technique we use for inquiring about the positional
data is safe. Tally has a num_sproxen member indicating how many
ScoreProx objects it has. That number would always be 0 for a Tally
supplied by a MatchPostingScorer.
You know, as an alternative to deleting all this positional stuff, we
could more or less finish it off. :) You and I both need positional
data. TermScorer provides it. PhraseScorer provides it modulo a
couple glaring, easily-squashed bugs. We just need the BooleanScorer
components to provide it, too.
The hardest part of what I want to do is figuring out exactly how
positional data should contribute to the score. It's going to be
tricky enough that I think it will require some trial and error.
However, we can defer that task. The positional data itself
probably won't change, so we can get the positional data engine
working and just not connect it up.
One thing I'm concerned about is RAM footprint. Let's try and come
up with a worst case scenario.
'the OR (and OR (it OR is)))'
KS isn't optimized for book-sized documents, but let's say that's
what's in an index, and the query above hits "War and Peace". Each
TermScorer will have a Posting object that grows to some large number
of KiB or even MiB to accommodate so many positions. That's OK, but
is there any danger of geometric growth as our position-resolving
algorithm recurses?
> I'd like to have a path that would
> continue to work even as I move to other posting formats, particularly
> the mmap() approach I mentioned.
That's not gonna fly. You'll need dedicated Posting subclasses for
mmap. MMapMatchPosting, MMapScorePosting, etc.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list