[KinoSearch] Re: KinoSearch feature suggestions

Marvin Humphrey marvin at rectangular.com
Tue Jan 22 14:35:16 PST 2008



Hi,

On Jan 21, 2008, at 2:16 PM, Father Chrysostomos wrote:

> I’d like to request that a few features be added to KinoSearch. I  
> need these features myself, so I’m willing to contribute patches.  
> Please let me know what you think.

I'm going to take the liberty of cc'ing this to the KinoSearch  
mailing list, since it was filed as a public rt.cpan.org issue.

> 1. Wildcards in search queries

I am in favor of wildcards being available via a separate  
distribution, and I would very much like to hammer out an elegant low- 
level API to support such a distro.  A lot of the work I have been  
doing lately is intended to facilitate such endeavors.

Wildcards should not be in core KS, because they are by their nature  
vastly more expensive than whole-word queries.  I have observed that  
their comparative cost often comes as an unpleasant shock.  However,  
providing a separate distro will prompt people to assess the costs  
with open eyes.

> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create  
> non-contiguous excerpts (which I’m calling ‘summaries’; the  
> contiguous sub-parts of each summary I’m calling excerpts):
>
> $highlighter->add_spec( excerpt_length => 50, summary_length =>  
> 200, ...);
>
> The highlighter would find the most important word to highlight (as  
> it currently does), and create a 50-char excerpt. Then it would  
> create an excerpt for the second most important word and add that  
> (removing overlap if necessary), repeating this process until the  
> summary is the right length.

I think this should be implemented by abstracting out the excerpt  
selection engine, analogous to the way that  
KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter  
abstract out other functionality used by the Highlighter.  How about  
if we outsource excerpting to subclasses of a new class,  
KinoSearch::Highlight::Excerpter?  Then you could release your own  
distro, e.g. KSx::Highlight::SummaryExcerpter.

> 3. Custom ellipsis marks:
>
> $highlighter->add_spec( ellipsis_mark => "\x{2026}", ... )

I understand the problem, but adding a such a specific param to  
Highlighter->add_spec seems brittle.  I think this should be  
something which is set via a custom excerpting engine.

Incidentally, Highlighter's treatment of the ellipsis also prompted  
part of <http://rt.cpan.org/Public/Bug/Display.html?id=25400>.

> 4. Pagination (another highlighter feature): An index field could  
> be designated as the ‘page offset’ field, containing byte offsets  
> of page breaks.
>
> $highlighter->add_spec(
> 	page_offset_field => 'pageoffsets',
> 	page_offset_formatter => $object,
> );
>
> And $object would have to have a page_label method: sub page_label  
> { my ($self, $fields_hashref, $page_no) = @_; ... }

This feature also seems like it should belong to a particular  
Excerpter implementation.

> Though it might be more complicated, maybe we could have page  
> breaks (chr 12) recorded automatically when the index is created.  
> Then ‘page_offset_field’ won’t be necessary.

That would work well.  It's trivial to implement effectively using C/ 
XS, because you can just zip along the string counting page breaks.

     long
     count_breaks(SV *input_sv) {
         STRLEN len;
         char *ptr = SvPV(input_sv, len);
         char *end = SvEND(input_sv);
         long count = 0;
         while (ptr < end) { if (*ptr++ == 12) count++; }
         return count;
     }

With Perl, tr// works for efficient character counting, IIRC.

> For examples of 2 and 4 in use, see <http://synodinresistance.org/ 
> cgi-bin/anazetesis?all=1&and-glossa=&and-morphe=&g=en&q=thing>  
> (which I’d like to switch to using KinoSearch, because it’s  
> currently too slow).

I admire the sophistication of the excerpting provided.  Kudos.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list