KinoSearch feature suggestions

Father Chrysostomos sprout at cpan.org
Tue Jan 22 17:04:35 PST 2008




On Jan 22, 2008, at 2:35 PM, Marvin Humphrey wrote:

> ...
>
> I am in favor of wildcards being available via a separate  
> distribution, and I would very much like to hammer out an elegant  
> low-level API to support such a distro.  A lot of the work I have  
> been doing lately is intended to facilitate such endeavors.

Is there anything I can do to help on the Perl side?
>
>
> Wildcards should not be in core KS, because they are by their nature  
> vastly more expensive than whole-word queries.  I have observed that  
> their comparative cost often comes as an unpleasant shock.  However,  
> providing a separate distro will prompt people to assess the costs  
> with open eyes.
>
>> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create  
>> non-contiguous excerpts (which I’m calling ‘summaries’; the  
>> contiguous sub-parts of each summary I’m calling excerpts):
>>
>> $highlighter->add_spec( excerpt_length => 50, summary_length =>  
>> 200, ...);
>>
>> The highlighter would find the most important word to highlight (as  
>> it currently does), and create a 50-char excerpt. Then it would  
>> create an excerpt for the second most important word and add that  
>> (removing overlap if necessary), repeating this process until the  
>> summary is the right length.
>
> I think this should be implemented by abstracting out the excerpt  
> selection engine, analogous to the way that  
> KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter  
> abstract out other functionality used by the Highlighter.  How about  
> if we outsource excerpting to subclasses of a new class,  
> KinoSearch::Highlight::Excerpter?

I think I can have a patch for this in a couple of days.

> Then you could release your own distro, e.g.  
> KSx::Highlight::SummaryExcerpter.
>
----8<-------8<------
>
>
>> 4. Pagination (another highlighter feature): An index field could  
>> be designated as the ‘page offset’ field, containing byte offsets  
>> of page breaks.
>>
>> $highlighter->add_spec(
>> 	page_offset_field => 'pageoffsets',
>> 	page_offset_formatter => $object,
>> );
>>
>> And $object would have to have a page_label method: sub page_label  
>> { my ($self, $fields_hashref, $page_no) = @_; ... }
>
> This feature also seems like it should belong to a particular  
> Excerpter implementation.
>
>> Though it might be more complicated, maybe we could have page  
>> breaks (chr 12) recorded automatically when the index is created.  
>> Then ‘page_offset_field’ won’t be necessary.
>
> That would work well.  It's trivial to implement effectively using C/ 
> XS, because you can just zip along the string counting page breaks.
>
>    [... etc ...]

But the *offsets* of the page breaks need to be recorded. Counting is  
not sufficient. I still have to think more about how this should work— 
unless you have some ideas.


Father Chrysostomos





More information about the kinosearch mailing list