[KinoSearch] Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions)
Marvin Humphrey
marvin at rectangular.com
Wed Jan 23 22:33:20 PST 2008
On Jan 23, 2008, at 12:48 PM, Father Chrysostomos wrote:
> Since the highlighter’s main job is to create the excerpt, I think
> it would actually be better if we made it easy to subclass by
> dividing up its _gen_excerpt method.
OK, I can see the rationale for this. The Excerpter would probably
need access to the Formatter and the Encoder anyway. At that point,
it's not really an excerpter -- it's a highlighter.
However, I think that the internal methods of Highlighter ought to
remain internal. They were not designed to be public. The division
of labor amongst them isn't particularly elegant or clean. They call
other, non-public methods within the KinoSearch suite.
I think we need the public interface for Highlighter to be more
general. And perhaps it usage shouldn't be integrated into Hits as
it is now. I think this, which is along the lines of Peter's
Search::Tools::HiLiter, would be the better:
my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher,
);
my $hits = $searcher->search( query => $query_string );
while ( my $hit = $hits->fetch_hit ) {
my $excerpts = $highlighter->generate_excerpts($hit);
...
}
In fact, using Search::Tools::HiLiter would look almost the same:
my $highliter = Search::Tools::HiLiter->new( query =>
$query_string );
my $hits = $searcher->search( query => $query_string );
while ( my $hit = $hits->fetch_hit_hashref ) {
my $excerpt = $highliter->light( $hit->{content} );
...
}
The only public methods on Highlighter would be generate_excerpts
($hit), get_formatter(), and get_encoder(). If we add others -- and
I can see how that would benefit you -- they should be more coherent
than the current internal methods.
Having the Highlighter operate standalone a la Search::Tools::HiLiter
just makes more sense. Unfortunately, the problem with that design
is that $hits->fetch_hit_hashref returns a plain old hashref, and
that's not enough. Crucially, the hashref does not convey the
document number, which is needed to retrieve the DocVector associated
with the hit from the $searcher.
I've pondered how to associate the return value of $hits-
>fetch_hit_hashref with a document number for a while, without
arriving at a satisfactory solution.
1) Stuffing it into the hashref like $hashref->{_kino_doc_num} is
cheesy. It's unexpected, messes up the parallels with DBI's
fetch_row_hashref, and would get in the way if someone wanted to
iterate over their fields using Perl's hash-manipulation functions.
2) It might make sense to implement KinoSearch::Document as a blessed
hashref. However, it would be the only class in all of KS which
didn't subclass KinoSearch::Obj, and thus which couldn't be used at
the C level.
3) Implementing KinoSearch::Document as a standard KS object and
giving it a $document->to_hashref method would work...
while ( my $hit = $hits->fetch_hit ) {
my $hashref = $hit->to_hashref;
my $excerpts = $highlighter->generate_excerpts($hit);
...
}
... but it's a little less elegant than returning a plain old hashref
and there's extra overhead from unnecessary string copying.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list