[KinoSearch] Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Fri Jan 25 02:11:28 PST 2008




On Jan 24, 2008, at 9:45 AM, Father Chrysostomos wrote:

> I’d certainly like to avoid copying and pasting the code for  
> calculating the best location and for ‘rounding’ the ends to the  
> nearest sentence.

Okeedoke.

> What would you suggest (or dictate, since you’re in charge :-) that  
> the methods be?

Well, let's brainstorm.

Right now, we're using Query->extract_terms as the raw input to the  
Highlighter.  There's a nasty, undocumented kludge in there:  
PhraseQuery returns an arrayref, while everything else returns an  
array, allowing Highlighter to differentiate between terms that  
should only match if they're in a phrase and terms that should match  
everywhere.  Then Highlighter basically duplicates the phrase  
matching logic of PhraseScorer and then conflates all phrase-matching  
positions with all other positions.

This is not an extensible approach.  Highlighter would need to be  
modified to duplicate the matching logic of any arbitrary Query/ 
Scorer regime in order to add its positions.

The only way to acquire highlight data extensibly is to go back up  
the chain to Query.

    my $highlight_data = $query->highlight_data($doc_vector,  
$field_name);

In its most basic form, the highlight data could be an array of  
positions.  However, I think it ought to be something richer -- an  
array of HighlightSpan objects.

   my $highlight_span = KinoSearch::Highlight::HighlightSpan->new(
     start_offset => 0,
     end_offset   => 16,
     weight       => 3.0
   );

Highlighter can offer a public method, heat_map(), which takes an  
array of HighlightSpan objects as input, and returns a  
KinoSearch::Highlight::HeatMap object.  This object would serve as a  
vessel for the kind of information currently conveyed via  
_starts_and_ends and _calc_best_location.  In theory, a HeatMap  
object might supply an array of float, one per character in the  
field; in practice, we'll need to dial that back.

The default Highlighter would use the HeatMap to find a single  
contiguous snippet.  Your subclass would use it to find multiple  
snippets.

As for the "rounding the ends" code... maybe a method called  
find_sentence_boundaries? generate_excerpts() can then make use of  
the boundary information however it sees fit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list