[KinoSearch] removing position code from scorer subclasses

Marvin Humphrey marvin at rectangular.com
Sun Jul 15 20:47:47 PDT 2007



On Jul 15, 2007, at 4:21 AM, Nathan Kurz wrote:

> I agree with you that ORScorer is going to be hard, but I think I'm
> actually positing a harder problem.  By 'generalized', I mean a phrase
> scorer that can make use of the multiple position arrays passed up by
> that ORScorer.
> The reason I want it to be generalized is not for the sake of
> PhraseScorer, but because my real goal is to write a positional scorer
> that can wrap a BooleanScorer in the same way that PhraseScorer wraps
> its subscorers (or rather what would be subscorers if it was
> generalized).

Mmm.  A wrapper class...

I'd been thinking that the positions logic would go into  
BooleanScorer itself.  But then, my goal was to give a little extra  
oomph to existing matches when positions occurred near each other, in  
order to improve relevancy.  You want to do something different.

> Multiple fields does complicate things a bit, but not much (I think):
> just associate the field used with the positional array.

That's what ScoreProx is for:

   struct kino_ScoreProx {
       KINO_SCOREPROX_VTABLE *_;
       KINO_OBJ_MEMBER_VARS;
       chy_i32_t               field_num;
       chy_u32_t               num_prox;
       chy_u32_t              *prox;
   };

> I think answer is that
> all position data is represented as an array of position arrays, with
> Term scorers just being the 1-dimensional case.

Yes.  TermScorers and PhraseScorers -- both of which are single-field  
-- keep their own ScoreProx and just populate it with new numbers for  
each match.

For TermScorer, all you have to do is copy pointers from the Posting  
object.

For PhraseScorer, you have to build up new ScoreProx data based on  
which positions actually matched.

All this is currently implemented,  though the information is not used.

[ Heh.  I just realized that PhraseScorer's implementation is buggy.   
I've added an explanatory comment to PhraseScorer.c.  ]

> Both And and Or
> scorers return a separate position array for each matching clause.

Exactly.  We'd need an array of ScoreProx objects, and a count  
indicating the number of elements in the array.  That's what's in  
Tally at this moment.

    struct kino_Tally {
       KINO_TALLY_VTABLE *_;
       KINO_OBJ_MEMBER_VARS;
       float                   score;
       chy_u32_t               num_matchers;
       chy_u32_t               sprox_cap;   /* sproxen allocation  
size */
       chy_u32_t               num_sproxen; /* number of ScoreProx */
       struct kino_ScoreProx **sproxen;     /* array of ScoreProx  
objects */
    };

>> When you're plowing through a postings file, you have to read pretty
>> much all the data.  You can't easily read the matching data and skip
>> the scoring data.  So you'll already have done most of the work by
>> the time you get to Tally.
>
> Yes, for Term scorers, but for things like a sloppy phrase scorer,
> determining how many times a phrase occurs (or almost occurs) might be
> quite expensive.

It's expensive, but the way things are now, that determination has to  
be made during the matching phase.  You have to know that the phrase  
occurs at least once in order to decide that the document matches;  
once you've started down that path, you might as well finish up and  
find all the phrase matches within the doc.

>> PhraseScorer doesn't actually have subscorers -- it handles
>> PostingList objects directly.  If we were to have PhraseScorer
>> operate over TermScorers, that would slow it down some -- hard to say
>> how much.
>
> Or worse, over ORScorers.  But this is the problem I'm proposing:  how
> can this be done efficiently?

I don't think efficiency is going to be a big problem, because the  
pathological cases (like your even/odd example) aren't common enough  
to worry about.

Vanilla PhraseScorer should remain as is, operating on PostingLists.   
Your custom scorer will operate on subscorers.

>> To guarantee such a behavior for all Posting subclasses, we'd have to
>> move the freq, prox, and num_prox member vars from ScorePosting into
>> the base class Posting.
>
> I'm not sure if it is better or worse, but I was suggesting moving
> those (well, the last two) into the base Scorer!

Posting is single-field, so it only needs prox and num_prox, while  
Scorers need to be able to pass multi-field data so they need sproxen  
and num_sproxen.  I think this information should be made available  
via the Scorer's Tally object, rather than via the Scorer struct.   
The Scorer struct's layout should remain private.

You seem very determined not to call Scorer_Tally when doing match- 
only.  There's a cost to generate the positions information, though,  
in PhraseScorer, OrScorer, etc.  How about delaying that effort until  
Tally is called?  If you don't care about scoring info, then don't  
calculate it -- just manipulate the positions.

> I'm hoping that the
> internals of Posting won't be visible at all from the component
> scorers perspective.

They wouldn't have to be visible to your high-level scorer.

>> > I'd also like it
>> > to be possible to do phrase-type matching on items other than raw
>> > terms: things like "a b|c d".
>>
>> In Lucene-land, people generally meet this requirement by injecting
>> synonyms at index time.  Implementing a search-time solution would be
>> tricky.
>
> Yeah, but that trickiness was the focus of this message.  If we want a
> position scorer that plays well with others, we would need to tackle
> it.  Should we?

Hell yeah.  I don't know whether the current PhraseScorer should be  
the base class, though.  I think you're right that we need a hybrid.

>> PhraseScorer is easy.
>>
>>    bool_t
>>    PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
>>    {
>>      /* if ($self->SUPER::skip_to($target)) */
>>      if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
>>         build_prox(self);
>>         return true;
>>      }
>>      return false;
>>    }
>
> This has what I feel is the significant disadvantage of finding the
> first occurrence of the phrase twice:  PhraseScorer_skip_to finds the
> first occurence of the phrase and returns a match, then build_prox
> does it again.

That's not the case.  The way PhraseScorer  works, you're left with  
an "anchor set" of positions at the end of the phrase matching  
process.  Each item in the anchor_set array represents one phrase.   
The static function build_prox in PhraseScorer.c just riffs off of  
that, building outward from each anchor.

>> > Scorer components (And, Or, Nor, Phrase) are directly reusable.
>>
>> Note that all of these *are* directly reusable whether the field
>> specifies a posting type of ScorePosting or RichPosting, and that the
>> only difference between the TermScorer subclasses ScorePostingScorer
>> and RichPostingScorer is a rebless!
>
> True, but I'm not finding them reuseable for positional stuff, as they
> don't make it positions available.

We may have no choice but to modify core then. :(

I would really like to avoid that if we can.  Adding to my  
maintenance burden is the wrong way to go.  I'd rather see you  
publish a CPAN distro, so you get the credit and the responsibility.

> And PhraseScorer (as well as the
> TermScorers, which I meant to include in the list) are currently
> TF/IDF specific.

Why does that matter?

>> Subclasses override Tally to provide their own scoring scheme.
>
> Are they required to subclass each TermScorer as well?  And each
> PhraseScorer?

> And what about when a custom scorer suite meets a
> custom index format?

Those are very general questions.  It's not possible to anticipate  
all the possible scoring configurations people will want to build.

> Yes, it's just a matter of figuring out how to signal this by some
> means other than a segfault.   Does PhraseScorer currently prevent
> this somewhere?

Of course.  If freq is 0, the positions array never gets dereferenced.

         u32_t *candidates      = posting->prox;
         u32_t *candidates_end  = posting->prox + posting->freq;

     /* ... */

             while (candidates < candidates_end && *candidates <  
target) {
                 candidates++;
             }

             if (candidates == candidates_end) {
                 break;
             }

> The real gain would come only if you iterate directly over the mapped
> index file without translation or buffering.   As such, a useful
> mmap-based index format would be bypassing InStream.   It couldn't use
> VInts, rather it would need to have a format that can be accessed and
> used directly as a C-struct.

This could be implemented using a custom Posting subclass.   For the  
simplest case -- doc num only -- you'd just have it write native  
integers, rather than deltas encoded as vints.   The size of such a  
postings file would probably increase by a factor of between 3 and 4.

Also note that these postings "files" are actually sub-files within a  
compound file.  That complicates things a bit, but it's probably not  
an insurmountable problem.

> Among the downsides are that the index files would be byte-order
> dependent, and as far as I know it won't work on Windows.

File format portability isn't a major issue with search indexes.

Sheer index size might be a problem.

I'll be a little concerned about bypassing InStream until I see an  
actual implementation.

InStream can work off of a RAMFolder, which stores data using a  
ragged array technique.  That won't be compatible with this.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list