Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

Father Chrysostomos sprout at cpan.org
Wed Jan 23 12:48:42 PST 2008




On Jan 23, 2008, at 6:32 AM, Marvin Humphrey wrote:

>>> How about if we outsource excerpting to subclasses of a new class,  
>>> KinoSearch::Highlight::Excerpter?
>>
>> I think I can have a patch for this in a couple of days.
>
> Sweet.  :)

Since the highlighter’s main job is to create the excerpt, I think it  
would actually be better if we made it easy to subclass by dividing up  
its _gen_excerpt method.

So, we’d have:

• gen_excerpt

This will call starts_and_ends and calc_best_location, then pass  
beginning and ending offsets for the excerpt to  
gen_excerpt_from_offsets. A subclass can override this to call the  
latter multiple times.

• starts_and_ends

Just _starts_and_ends renamed, so that subclasses can call it while  
still using the public API.

• calc_best_location

_calc_best_location renamed, and made to return a list in list context.

• get_excerpt_from_offsets

This will ‘round off’ the offsets passed to it to the nearest sentence  
boundary, if possible, and then call format_excerpt (passing it a  
couple of flags to indicate whether ellipsis marks are needed).

• format_excerpt

This will take of all the formatting, calling the formatter and  
encoder as needed, and adding ellipsis marks.


Please let me know if this is too complex and there is a better way I  
haven’t thought of....

>
>
>> But the *offsets* of the page breaks need to be recorded. Counting  
>> is not sufficient. I still have to think more about how this should  
>> work—unless you have some ideas.
>
> We can modify that function to record offsets in a Perl array.  This  
> (untested) variant renders those offsets as counts of Unicode code  
> points:
>
> [...]

I don’t know why I didn’t see this sooner, but the indexer/tokenizer/ 
whatever doesn’t need to care about form feeds. A highlighter subclass  
can use your counting method (or y///) to see how many occur before  
the excerpt, so that problem has solved itself, as it were.



More information about the kinosearch mailing list