[KinoSearch] removing position code from scorer subclasses

Nathan Kurz nate at verse.com
Sun Jul 15 23:29:30 PDT 2007


On 7/15/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> > Multiple fields does complicate things a bit, but not much (I think):
> > just associate the field used with the positional array.
>
> That's what ScoreProx is for:
>
>    struct kino_ScoreProx {
>        KINO_SCOREPROX_VTABLE *_;
>        KINO_OBJ_MEMBER_VARS;
>        chy_i32_t               field_num;
>        chy_u32_t               num_prox;
>        chy_u32_t              *prox;
>    };

Yes, it's mostly there. I  fear that for generality we  also need:
            chy_u32_t *lengths;
If null, the lengths are all presumed as one, but if present it has
the same length as prox and each entry represents the length of that
occurence.
Sometimes I can convince myself that you could get away with a single
length entry, but suppose you are trying to do a phrase match on:
'a b|"c d" e'.   Ugly, but if we want generality I think we need per
position length.   You mentioned something about planning for this?

> Exactly.  We'd need an array of ScoreProx objects, and a count
> indicating the number of elements in the array.  That's what's in
> Tally at this moment.
>
>     struct kino_Tally {
>        KINO_TALLY_VTABLE *_;
>        KINO_OBJ_MEMBER_VARS;
>        float                   score;
>        chy_u32_t               num_matchers;
>        chy_u32_t               sprox_cap;   /* sproxen allocation
> size */
>        chy_u32_t               num_sproxen; /* number of ScoreProx */
>        struct kino_ScoreProx **sproxen;     /* array of ScoreProx
> objects */
>     };

Yup, just about exactly that.  The 'Positions' object in the stuff I
quoted was intended to be an encapsulation of the last three fields
here.  I'm not sure about the name, but I think it would be good to
factor it out.


> [ Heh.  I just realized that PhraseScorer's implementation is buggy.
> I've added an explanatory comment to PhraseScorer.c.  ]

Since you bring it up, I think there's another bug in that loop: the
phrase_offsets are ignored and the phrase is presumed to be
contiguous.   But  it seemed churlish to report a bug in unused
section I wasn't sure I understood when I was about to take out
anyway.  Since it's not being used, I wasn't sure if it was supposed
to be that way:  is it?

> I think this information should be made available
> via the Scorer's Tally object, rather than via the Scorer struct.
> The Scorer struct's layout should remain private.

I do like the way that Tally currently works, but I'm not sure that
exposing a data API based on the struct layout is a bad thing,
particularly as we are already depending on a rigid layout to make
BoilerPlater work.

> You seem very determined not to call Scorer_Tally when doing match-
> only.  There's a cost to generate the positions information, though,
> in PhraseScorer, OrScorer, etc.  How about delaying that effort until
> Tally is called?

Certainly the goal is to delay it as  long as possible in the hope
that one won't have to do it at all, but I think there are times
(Position scoring a PhraseScorer) when you are either going to need to
Tally the subscorers during the parent's matching phase (as OrScorer
does now) else come up with a different way of accessing position
information.  You may be right that there is no need to force all the
tallying to occur after all the matching, though.  I like that rule
for conceptual clarity, but there's no implementation reason it should
be that way.

> >> > terms: things like "a b|c d".
> >>
> > Yeah, but that trickiness was the focus of this message.  If we want a
> > position scorer that plays well with others, we would need to tackle
> > it.  Should we?
>
> Hell yeah.

Great!  I think if we do end up with an architecture that enables that
level of generality, there will be a lot of interesting things one can
do with this.

On  a slightly related note regarding the potential target audience,
did you happen to see this post from Facebook on why they wrote own
search engine:  http://blog.facebook.com/blog.php?post=2535632130
It's similar to some of scoring mechanisms I'm thinking about.


> That's not the case.  The way PhraseScorer  works, you're left with
> an "anchor set" of positions at the end of the phrase matching
> process.

My fault.  I'd read through that before, but just remembered it all wrong.

> I would really like to avoid that if we can.  Adding to my
> maintenance burden is the wrong way to go.

Au contraire!  From my position, it's definitely the way to go!  Is
there some reason it doesn't seem like a perfect solution to you?  :)

> > And PhraseScorer (as well as the
> > TermScorers, which I meant to include in the list) are currently
> > TF/IDF specific.
>
> Why does that matter?

Because they don't (in my limited experience) make good base classes
for scorers using other schemes, thus one ends up rewriting them
rather than subclassing them.  But it may just be me not using them
properly.

> Those are very general questions.  It's not possible to anticipate
> all the possible scoring configurations people will want to build.

No, but we can try to anticipate the likely configurations:  TF/IDF,
match-only, and position/scope, for starters.

> > Yes, it's just a matter of figuring out how to signal this by some
> > means other than a segfault.   Does PhraseScorer currently prevent
> > this somewhere?
>
> Of course.  If freq is 0, the positions array never gets dereferenced.

I was more worried that the explicit cast to ScorePosting might cause
posting->freq and posting->prox to point to something unrelated if a
custom Posting class  inherits directly from Posting rather than from
ScorePosting.  Is this case caught somewhere?

>> (mmap)
> The size of such a
> postings file would probably increase by a factor of between 3 and 4.

Yes, there's definitely a size trade-off.   If the active portion of
your expanded dataset is larger than available RAM and you start
paging from disk per search, the advantage goes away.  But if it does
fit (or can be split across enough machines so it does) you end up
with something faster than you could achieve otherwise, with no
startup penalty, and perfect interprocess sharing.

> I'll be a little concerned about bypassing InStream until I see an
> actual implementation.

Certainly, and you should be.  While this might be a useful custom
format, it's not going to be able to replace the existing formats in
most cases.

Thanks for your responses.  I'll try to send you a
position-code-removing patch in the next week.

Nathan Kurz
nate at verse.com



More information about the KinoSearch mailing list