[KinoSearch] passing positions

Marvin Humphrey marvin at rectangular.com
Fri Sep 7 00:53:25 PDT 2007




On Sep 6, 2007, at 12:17 AM, Nathan Kurz wrote:

>> One of your inventions is Scorer_Advance.  I like it as a substitute
>> for Scorer_Next, and it might be worth a global search and replace
>> since that method isn't public yet. :)  However, in your code it
>> appears to be a substitute for Scorer_Skip_To.
>
> I'm hoping to collapse those two down to a single function.

That would be very nice.  I tried to pull that off, but I ran into  
some problem.  I don't remember what it was, though.  :(

> Yes, I think that PhraseScorer should use a subscorer and not  
> PostingLists.
> That said, it may be simpler to restrict complexity of that subscorer
> at least temporarily so that we don't have to start with a fully
> recursive phrase scorer.
>
> Something like allowing:
> PhraseScorer -> AndScorer -> [TermScorer TermScorer TermScorer]

Good plan.  Can we do that now, in isolation from the rest of the  
changes?

>> I think similar reasoning led you to Match and me to Tally.
>
> Well, that and the hope that if I paralleled Match and Tally you'd
> like the idea better :).

Heh.

>>> The trickiness (and I don't like trickiness) is that each Match is
>>> allowed to contain either an array of positions, or an array of  
>>> Match
>>> structs:
>>
>> I doubt that's necessary.  Just create a default wrapper at the
>> lowest level.  That's how TermScorer does things presently.
>
> I fear the trickiness is still necessary at some level, but I think
> I've managed to hide it in a place you'll like better.  Essentially,
> I'm going to propose two main subclasses for Scorer, MultiScorer and
> MatchScorer.  MultiScorer's contain a public VArray of other Scorer's,
> while MatchScorer's contain a public Match struct.

Interesting. Do you end up with more subscorers than before?

>> This variable name violates my "avoid overload overload" rule. :)
>> "field" has a very specific meaning in the context of KS and this
>> isn't it.
>
> I agree with you in general, but I thought this was the specific
> meaning.   It's removed from Match in my new incarnation, but would
> would you prefer it to be called:  'index_field', 'field_num'?

field_num.

"field", when it's used at all, means "field name".  It used to mean  
a Field object -- before I killed that class -- and that's still the  
place it holds in concept-space.

>> This was the driving factor behind the ScoreProx class.
>
> I've forgotten the details, but I came to the conclusion that
> ScoreProx was at odds with Rich Positions, and that to allow a
> Proximity type scorer to use Positions specific weights some wider
> interface was needed.

I'm having trouble visualizing this.  I wish there was a way to  
divide and conquer this problem more effectively.

>> Collation of positions gets complicated when these scorers are  
>> nested.
>
> It's possible we are defining terms differently here, but my current
> plan is that there never will be any collation.   Instead, the
> MultiScorer's (AndScorer, OrScorer) will allow their children's Match
> structs to be accessed directly.

I think you need collation for the PhraseScorer.  Say you're  
iterating over positions in several subscorers.  You have position 35  
and 36; now you need  37. If you haven't kept track of where each  
subscorer is at, you'll have to start from scratch with each one.    
If you don't, and the subscorer has multiple subscorers itself, you  
might miss something.

> I tried to pursue collation at one
> point, and gave up: positions from multiple fields, phrases of
> different lengths.

Yes, this is the same problem that thwarted me in my first go.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list