[KinoSearch] removing position code from scorer subclasses

Nathan Kurz nate at verse.com
Sun Jul 15 04:21:13 PDT 2007


On 7/15/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> > As I mentioned in a previous message, I think that PhraseScorer is a
> > good example to think about.  If we come up with a solution that
> > allows for a generalized and efficient phrase scorer, I think we'll be
> > on the right path.
>
> I think ORScorer is likely to be the hardest.  ANDScorer and
> ANDORScorer aren't going to be much easier.

I agree with you that ORScorer is going to be hard, but I think I'm
actually positing a harder problem.  By 'generalized', I mean a phrase
scorer that can make use of the multiple position arrays passed up by
that ORScorer.
The reason I want it to be generalized is not for the sake of
PhraseScorer, but because my real goal is to write a positional scorer
that can wrap a BooleanScorer in the same way that PhraseScorer wraps
its subscorers (or rather what would be subscorers if it was
generalized).

Multiple fields does complicate things a bit, but not much (I think):
just associate the field used with the positional array.  Dealing with
multiple position arrays at all is going to be the complicated part,
whether they are single or multiple fields.  I think answer is that
all position data is represented as an array of position arrays, with
Term scorers just being the 1-dimensional case.  Both And and Or
scorers return a separate position array for each matching clause.

>> 3. The scoring stage only occurs for documents that match, so can
>> be slower.
>
> Maybe it can be, but it probably won't be.

Maybe I should define what I mean by 'optimizeable' and 'slower'.  To
me, optimizeable C code means that I want to be able to avoid function
calls, think about L2 cache optimization, and generally pick apart the
assembly code to see what the compiler produced.  Slower means using a
smart algorithm, but not worrying much about things like function call
overhead.  So while it will be slower, it hopefully will not be slow.

> I think worrying about the cost of calling Scorer_Tally() may be
> premature optimization.

You may be quite right that due to the optimization that happens
during matching phase, there isn't any great difference between the
matching and scoring phases.   But I would argue that there is a
difference between designing an architecture that makes the
optimization possible (which strikes me as prudent) and actually
performing that optimization (which I agree is premature).   But I
certainly could be silly wrong about that.

> When you're plowing through a postings file, you have to read pretty
> much all the data.  You can't easily read the matching data and skip
> the scoring data.  So you'll already have done most of the work by
> the time you get to Tally.

Yes, for Term scorers, but for things like a sloppy phrase scorer,
determining how many times a phrase occurs (or almost occurs) might be
quite expensive.

> PhraseScorer doesn't actually have subscorers -- it handles
> PostingList objects directly.  If we were to have PhraseScorer
> operate over TermScorers, that would slow it down some -- hard to say
> how much.

Or worse, over ORScorers.  But this is the problem I'm proposing:  how
can this be done efficiently?  The right answer might be to deny that
the question matters.  I can definitely do without a general solution
for my purposes, but if I do that, any solution I come up with won't
be that useful to anyone else.

> To guarantee such a behavior for all Posting subclasses, we'd have to
> move the freq, prox, and num_prox member vars from ScorePosting into
> the base class Posting.

I'm not sure if it is better or worse, but I was suggesting moving
those (well, the last two) into the base Scorer!  I'm hoping that the
internals of Posting won't be visible at all from the component
scorers perspective.

> > I'd also like it
> > to be possible to do phrase-type matching on items other than raw
> > terms: things like "a b|c d".
>
> In Lucene-land, people generally meet this requirement by injecting
> synonyms at index time.  Implementing a search-time solution would be
> tricky.

Yeah, but that trickiness was the focus of this message.  If we want a
position scorer that plays well with others, we would need to tackle
it.  Should we?

>>  I think that each
>>  position-passing subclass is going to end up duplicating most of the
>>  code within its 'parent' class (efficient but bug-prone), or doing the
>>  work twice (think of a PhraseScorer that has already found a phrase).
>
> PhraseScorer is easy.
>
>    bool_t
>    PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
>    {
>      /* if ($self->SUPER::skip_to($target)) */
>      if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
>         build_prox(self);
>         return true;
>      }
>      return false;
>    }

This has what I feel is the significant disadvantage of finding the
first occurrence of the phrase twice:  PhraseScorer_skip_to finds the
first occurence of the phrase and returns a match, then build_prox
does it again.
Perhaps this is acceptable, though.

> > Scorer components (And, Or, Nor, Phrase) are directly reusable.
>
> Note that all of these *are* directly reusable whether the field
> specifies a posting type of ScorePosting or RichPosting, and that the
> only difference between the TermScorer subclasses ScorePostingScorer
> and RichPostingScorer is a rebless!

True, but I'm not finding them reuseable for positional stuff, as they
don't make it positions available.  And PhraseScorer (as well as the
TermScorers, which I meant to include in the list) are currently
TF/IDF specific.

> >   1. Base components do not presume any particular scoring scheme.
>
> Implementation:
>
>    Tally*
>    Scorer_tally(Scorer *self)
>    {
>        ABSTRACT_DEATH(self, "Tally");
>        UNREACHABLE_RETURN(Tally*);
>    }
>
> Subclasses override Tally to provide their own scoring scheme.

Are they required to subclass each TermScorer as well?  And each
PhraseScorer?  And what about when a custom scorer suite meets a
custom index format?  And can God create a rock so large that even he
cannot lift it?  OK, skip that last question. :)


> We could modify MatchPosting to include freq and position data, which
> would make phrase matching possible.

MatchPosting is just fine the way it is.  If it wasn't, the way it
was, I'd have to write it!

> >   3. Scorers fail gracefully if required data not in the index.
>
> The way to do this is to indicate that the document matches, but
> there aren't any positions.  The position loop code will never be
> entered.

Yes, it's just a matter of figuring out how to signal this by some
means other than a segfault.   Does PhraseScorer currently prevent
this somewhere?
I confess I didn't know about the FieldSpec->posting_type you mentioned.

> > Custom index formats require only a single custom term scorer.
> >   1. This term scorer is the sole interface that a search touches.
> >   2. Scorers don't need to be aware of underlying index format.
>
> Yes.  :)  That's how things are working right now.

Well, sort of.  PhraseScorer currently uses the Posting directly,
instead of via a Scorer.  I was suggesting that it should go through
the scorer as well.

> >   3. A speed-for-size trading mmap'ed index format should be possible.
>
> To be frank, my experience with mmap is not extensive.  FWIW, there's
> a Lucene class called MMapDirectory. (Directory => Folder in KS.)
> It's gotten mixed reviews.
>
> I don't like it because porting it would require that InStream become
> subclassable.  Right now, InStream is a "final" class, which means
> that all of its method macros are just aliases for the actual functions:

mmap() provides a way to directly use the system pager to handle
caching and pool filling very efficiently and easily.  Being system
level, it also means that cache is automatically shared across
processes.  It provides no real advantage is used simply to replace
File IO.

The real gain would come only if you iterate directly over the mapped
index file without translation or buffering.   As such, a useful
mmap-based index format would be bypassing InStream.   It couldn't use
VInts, rather it would need to have a format that can be accessed and
used directly as a C-struct.

Without having done the tests, I'm guessing it's probably worth at
least a 2x speedup (probably more), with better shared cache
performance as well.
Among the downsides are that the index files would be byte-order
dependent, and as far as I know it won't work on Windows.

In my somewhat considered opinion, it's as close to a silver bullet as
exists for a problem of this sort.  While I hadn't known about
MMapDirectory, I'm pretty sure it's not doing the same thing, as I
don't think it's even possible to mmap a struct in Java.   It's really
only makes sense for languages that make their internals fully
accessible, but in this case it's certainly worth exploring further.

Nathan Kurz
nate at verse.com



More information about the kinosearch mailing list