[KinoSearch] Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Sun Jan 27 19:56:43 PST 2008




On Jan 27, 2008, at 6:26 PM, Father Chrysostomos wrote:

> Actually what I’ve done so far hash ->singe_excerpt($hit, \%spec),  
> but I think what you have is better.

Glad it works for you.  With only a couple exceptions, argument lists  
in KS take one of the following two forms:

    $foo->($single_arg);
    $foo->(%labeled_params);

> I’ve also made doc_vector an attribute of Hit (see the attached  
> file [if I remember to attach it after typing this message]). Now  
> I’m not certain that the hit needs a reference to the doc number,  
> but it’s in there.

People who don't want highlighting shouldn't have to pay the penalty  
of retrieving the DocVector by default.
In any case, having HitDoc carry all that information seems kind of  
messy.

The doc_num is the key that enables you to get at all that info --  
and it's just a nice, featherweight integer.

> Also, it has a reference to the query, so that single_excerpt can call
>
> 	$hit_doc->highlight_data( $excerpt_field )
>
> and just has to pass one arg.

Oh, yeah -- forgot about that.

How about supplying the query as an argument to Highlighter's  
constructor?  That's how Search::Tools::HiLiter works.

   my $hiliter = Search::Tools::HiLiter->new(
      query => $query,
   );

   my $highlighter = KinoSearch::Highlight::Highlighter->new(
       searcher => $searcher,
       query    => $query,
   );

If the 'query' param is just a query string, we can ask $searcher to  
perform its default parsing (currently in  
KinoSearch::Search::Searchable::_prepare_simple_search).  If it's an  
object, we assume that it's the same Query object that was supplied  
to $searcher->search.

> This highlight_data method calls the method of the same name on the  
> query and then sorts its return value and removes duplicates.
>
>> The DocVector object would be retrieved within single_excerpt() --  
>> which becomes possible once the Highlighter gets a Searcher at  
>> construction time.
>
> DocVector is currently documented as a private class. Do we want a  
> ‘publicly subclassable’ method to have to deal with it?

Ultimately, we want the information that is currently in DocVector to  
be available via a public API.

The word "vector" is only there for Lucene legacy reasons.   
TermVector data in Lucene is used for other things besides  
highlighting -- but IMO it's only highlighting that's crucial.   
Highlighting is really important.

I think that as we write the Query methods for obtaining  
HighlightSpans, we might have ideas about how to improve DocVector/ 
TermVector.  By the time that's done, hopefully we'll have an  
acceptable public API.

Following up on this, I think FieldSpec::vectorized should be  
replaced by FieldSpec::highlightable.  highlightable() (or maybe just  
"highlight"?) should default to 0, unlike the way  
FieldSpec::vectorized() currently defaults to 1.  However, the schema  
used by KinoSearch::Simple will enable highlighting by default, so  
that novice users will still have it easy.

>  But I don’t know what the Doc is currently for....

It will be clear once Doc is integrated into the indexing stage.

I have a cargo-cult solution that makes overloading work correctly  
with KinoSearch::Obj subclasses under 5.8.8, but I'm trying to  
understand why it works by spelunking the relevant parts of the Perl  
source, and that's slowing me down.

> Also, do we need a HighlightSpan object? Won’t a simple hash do?  
> Likewise with a heat map.

There are a number of reasons to use objects instead of hashes.

First, auto-vivification is evil.  Mistyped hash keys should not not  
result in subtly incorrect behavior, e.g. because a default was used  
instead of the supplied arg.

Second, classes which pass around hashes are tightly coupled, and  
thus fragile and convoluted.  Spelunking the code for the old  
Test::Harness (which I had to do when hacking on David Wheeler's  
JavaScript port of it) was really unpleasant for this reason.  Using  
an intermediate class (with accessor methods) to convey data between  
two classes forces much more robust and well-defined interaction.

Third, a lot of this stuff is going to get ported to C eventually,  
where using a dedicated class is actually easier than passing around  
a Hash object because hash manipulation syntax isn't built into the  
language.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list