[KinoSearch] removing position code from scorer subclasses
Marvin Humphrey
marvin at rectangular.com
Sun Jul 15 00:12:12 PDT 2007
Nathan,
We're pretty much on the same page. Some of what you suggest is
already implemented. :)
On Jul 14, 2007, at 5:15 PM, Nathan Kurz wrote:
> As I mentioned in a previous message, I think that PhraseScorer is a
> good example to think about. If we come up with a solution that
> allows for a generalized and efficient phrase scorer, I think we'll be
> on the right path.
I think ORScorer is likely to be the hardest. ANDScorer and
ANDORScorer aren't going to be much easier.
PhraseScorer deals with multiple PostingLists, but all of them belong
to the same field, so their intersection is a single array. In
contrast, the BooleanScorer components have to resolve/remember
multiple positions arrays belonging to different fields -- which is
harder.
> 2. Matching is done for every possible document, thus should be
> optimizable.
Searches like 'rare_term AND common_term' and '"rare_term
common_term"' are already optimized, and do not perform matching for
every document, thanks to the fact that all the Scorer subclasses in
KS 0.20_04 properly implement Scorer_Skip_To. (SegPList_Skip_To is
where the actual optimization takes place.)
> 3. The scoring stage only occurs for documents that match, so can
> be slower.
Maybe it can be, but it probably won't be.
When you're plowing through a postings file, you have to read pretty
much all the data. You can't easily read the matching data and skip
the scoring data. So you'll already have done most of the work by
the time you get to Tally. Take a look at ScorePost_Read_Record,
which you need for matching, and ScorePostScorer_Tally, which isn't
very costly.
I think worrying about the cost of calling Scorer_Tally() may be
premature optimization.
> You are proposing that the position information (when needed) be
> passed along as part of the Tally. This creates problems for scorers
> like PhraseScorer, which need[*] to get positions from its subscorers
> before determining if the document is a match.
PhraseScorer doesn't actually have subscorers -- it handles
PostingList objects directly. If we were to have PhraseScorer
operate over TermScorers, that would slow it down some -- hard to say
how much.
> The current version of PhraseScorer sidesteps this by working directly
> off the position data in the Posting. While I like the efficiency of
> this approach, I'm worried it is not going to extend well to custom
> index formats. In particular, I don't want the author of a new index
> format to be required to write a CustomPhraseScorer.
So long as the Posting subclass can supply an array of positions and
a freq, PhraseScorer_Next will work fine.
To guarantee such a behavior for all Posting subclasses, we'd have to
move the freq, prox, and num_prox member vars from ScorePosting into
the base class Posting. I have some misgivings about that because
the bigger Posting is, the less efficient various implementations of
Post_Bulk_Read will be. Maybe we can figure out a better way to
handle things there. :)
> I'd also like it
> to be possible to do phrase-type matching on items other than raw
> terms: things like "a b|c d".
In Lucene-land, people generally meet this requirement by injecting
synonyms at index time. Implementing a search-time solution would be
tricky.
> I think there needs to be a way to signal to a Scorer that its parent
> wants this Positions to be set.
Ugh, I'm not a fan of flags like that. You wind up with a bunch of
extra conditional code everywhere.
http://www.refactoring.com/catalog/
replaceConditionalWithPolymorphism.html
> You suggested that this be done by
> subclasses instead, but I don't see this working well. First, it
> would double the number of Scorer classes, or worse, it wouldn't
> double them and if you wanted a position scorer you'd have to subclass
> the existing scorers yourself. Second, I think that each
> position-passing subclass is going to end up duplicating most of the
> code within its 'parent' class (efficient but bug-prone), or doing the
> work twice (think of a PhraseScorer that has already found a phrase).
PhraseScorer is easy.
bool_t
PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
{
/* if ($self->SUPER::skip_to($target)) */
if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
build_prox(self);
return true;
}
return false;
}
> I think the path between these rocks is going to involve conflating
> some of the notions of Posting and Scorer, in particular making
> position information available from a Scorer in the way analagous to
> that which PhraseScorer currently grabs it directly from
> Postings. Rather than having the position data within the Tally, I
> think it needs to be part of the base Scorer class, so that it can
> always be accessed as directly as Scorer->positions.
This is fairly close to how things are now. Tally objects "belong"
to a particular Scorer.
> Thus here's my view of the end goal of the Scorer classes:
>
> Scorer components (And, Or, Nor, Phrase) are directly reusable.
Note that all of these *are* directly reusable whether the field
specifies a posting type of ScorePosting or RichPosting, and that the
only difference between the TermScorer subclasses ScorePostingScorer
and RichPostingScorer is a rebless!
> 1. be used for Match-only (simple -- don't call Tally)
Or, make calling Tally so cheap that it doesn't matter:
BoolPostingScorer_Tally(BoolPostingScorer *self)
{
return self->tally;
}
> 2. be used for Scoring (great --- call Tally on subscorers as
> necessary)
> 3. Make positions of matches available to parent scorer prior to
> Tally
>
> Custom scoring algorithms can layer on top of standard components.
> 1. Base components do not presume any particular scoring scheme.
Implementation:
Tally*
Scorer_tally(Scorer *self)
{
ABSTRACT_DEATH(self, "Tally");
UNREACHABLE_RETURN(Tally*);
}
Subclasses override Tally to provide their own scoring scheme.
> 2. Scorer can signal to its subscorers that it needs position data.
I think a better way to handle this is to make the presence or
absence of position data a property of the field, via FieldSpec-
>posting_type.
As originally envisioned, "MatchPosting" would not provide any
positional data. A MatchPosting file is just a stack of doc nums.
(Actually, compressed delta doc nums).
<doc_num>+
Perhaps we should rename "MatchPosting" to "BoolPosting", and have KS
throw an error if you try to create a phrase-match against a field
with that posting_type.
We could modify MatchPosting to include freq and position data, which
would make phrase matching possible.
<doc_num, freq, <position>+>+
That's perilously close to the current implementation of
ScorePosting, though:
<doc_num, freq, shared_boost, <position>+>+
The only difference is a single byte per document for the shared boost.
> 3. Scorers fail gracefully if required data not in the index.
The way to do this is to indicate that the document matches, but
there aren't any positions. The position loop code will never be
entered.
> Custom index formats require only a single custom term scorer.
> 1. This term scorer is the sole interface that a search touches.
> 2. Scorers don't need to be aware of underlying index format.
Yes. :) That's how things are working right now.
However, position-aware Scorers are fundamentally different from
ordinary scorers, and probably should be distinct classes.
> 3. A speed-for-size trading mmap'ed index format should be possible.
To be frank, my experience with mmap is not extensive. FWIW, there's
a Lucene class called MMapDirectory. (Directory => Folder in KS.)
It's gotten mixed reviews.
I don't like it because porting it would require that InStream become
subclassable. Right now, InStream is a "final" class, which means
that all of its method macros are just aliases for the actual functions:
/* (from InStream.r, generated by Boilerplater) */
#define Kino_InStream_Read_VInt(self) \
kino_InStream_read_vint((kino_InStream*)self)
If we have to allow both buffering and non-buffering versions of
InStream, then I need to make InStream's methods overrideable, which
means a double dereference:
#define Kino_InStream_Read_VInt(self) \
(self)->_->read_vint((kino_InStream*)self)
I rather like the simplicity of having only one InStream.
> How do these strike you as goals? These make sense to me, but perhaps
> I'm confusing a particular implementation with a general need.
> Alternatively, I'm willing to just remove the position data as you
> suggest, but I think it might produce a better end result if I better
> understood where we were headed.
This position data was never used, because I never wrote the
implementation for BooleanScorer.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list