KinoSearch feature suggestions
Father Chrysostomos
sprout at cpan.org
Tue Jan 22 17:04:35 PST 2008
On Jan 22, 2008, at 2:35 PM, Marvin Humphrey wrote:
> ...
>
> I am in favor of wildcards being available via a separate
> distribution, and I would very much like to hammer out an elegant
> low-level API to support such a distro. A lot of the work I have
> been doing lately is intended to facilitate such endeavors.
Is there anything I can do to help on the Perl side?
>
>
> Wildcards should not be in core KS, because they are by their nature
> vastly more expensive than whole-word queries. I have observed that
> their comparative cost often comes as an unpleasant shock. However,
> providing a separate distro will prompt people to assess the costs
> with open eyes.
>
>> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create
>> non-contiguous excerpts (which I’m calling ‘summaries’; the
>> contiguous sub-parts of each summary I’m calling excerpts):
>>
>> $highlighter->add_spec( excerpt_length => 50, summary_length =>
>> 200, ...);
>>
>> The highlighter would find the most important word to highlight (as
>> it currently does), and create a 50-char excerpt. Then it would
>> create an excerpt for the second most important word and add that
>> (removing overlap if necessary), repeating this process until the
>> summary is the right length.
>
> I think this should be implemented by abstracting out the excerpt
> selection engine, analogous to the way that
> KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter
> abstract out other functionality used by the Highlighter. How about
> if we outsource excerpting to subclasses of a new class,
> KinoSearch::Highlight::Excerpter?
I think I can have a patch for this in a couple of days.
> Then you could release your own distro, e.g.
> KSx::Highlight::SummaryExcerpter.
>
----8<-------8<------
>
>
>> 4. Pagination (another highlighter feature): An index field could
>> be designated as the ‘page offset’ field, containing byte offsets
>> of page breaks.
>>
>> $highlighter->add_spec(
>> page_offset_field => 'pageoffsets',
>> page_offset_formatter => $object,
>> );
>>
>> And $object would have to have a page_label method: sub page_label
>> { my ($self, $fields_hashref, $page_no) = @_; ... }
>
> This feature also seems like it should belong to a particular
> Excerpter implementation.
>
>> Though it might be more complicated, maybe we could have page
>> breaks (chr 12) recorded automatically when the index is created.
>> Then ‘page_offset_field’ won’t be necessary.
>
> That would work well. It's trivial to implement effectively using C/
> XS, because you can just zip along the string counting page breaks.
>
> [... etc ...]
But the *offsets* of the page breaks need to be recorded. Counting is
not sufficient. I still have to think more about how this should work—
unless you have some ideas.
Father Chrysostomos
More information about the kinosearch
mailing list