[KinoSearch] Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions)
Marvin Humphrey
marvin at rectangular.com
Fri Jan 25 02:11:28 PST 2008
On Jan 24, 2008, at 9:45 AM, Father Chrysostomos wrote:
> I’d certainly like to avoid copying and pasting the code for
> calculating the best location and for ‘rounding’ the ends to the
> nearest sentence.
Okeedoke.
> What would you suggest (or dictate, since you’re in charge :-) that
> the methods be?
Well, let's brainstorm.
Right now, we're using Query->extract_terms as the raw input to the
Highlighter. There's a nasty, undocumented kludge in there:
PhraseQuery returns an arrayref, while everything else returns an
array, allowing Highlighter to differentiate between terms that
should only match if they're in a phrase and terms that should match
everywhere. Then Highlighter basically duplicates the phrase
matching logic of PhraseScorer and then conflates all phrase-matching
positions with all other positions.
This is not an extensible approach. Highlighter would need to be
modified to duplicate the matching logic of any arbitrary Query/
Scorer regime in order to add its positions.
The only way to acquire highlight data extensibly is to go back up
the chain to Query.
my $highlight_data = $query->highlight_data($doc_vector,
$field_name);
In its most basic form, the highlight data could be an array of
positions. However, I think it ought to be something richer -- an
array of HighlightSpan objects.
my $highlight_span = KinoSearch::Highlight::HighlightSpan->new(
start_offset => 0,
end_offset => 16,
weight => 3.0
);
Highlighter can offer a public method, heat_map(), which takes an
array of HighlightSpan objects as input, and returns a
KinoSearch::Highlight::HeatMap object. This object would serve as a
vessel for the kind of information currently conveyed via
_starts_and_ends and _calc_best_location. In theory, a HeatMap
object might supply an array of float, one per character in the
field; in practice, we'll need to dial that back.
The default Highlighter would use the HeatMap to find a single
contiguous snippet. Your subclass would use it to find multiple
snippets.
As for the "rounding the ends" code... maybe a method called
find_sentence_boundaries? generate_excerpts() can then make use of
the boundary information however it sees fit.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list