[KinoSearch] Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Sun Jan 27 17:14:33 PST 2008




On Jan 26, 2008, at 2:09 PM, Father Chrysostomos wrote:
> Do you mean to eliminate add_spec?

No, that was just an oversight while writing untested code for email.

> 2) Forget about get_(formatter|encoder), since each spec might have  
> a different one.

Yes.   I'd unconsciously reverted to the old API, where there was  
only one formatter/encoder per highlighter.   (: It's because I make  
mistakes like these that KS has arg checking everywhere.  :)

> 3) Make generate_excerpts call generate_excerpt (_gen_excerpt  
> renamed); or maybe we should call it single_excerpt, to  
> differentiate between it and the former more easily. single_excerpt  
> will be called with its current args, and can be overridden in a  
> subclass. The $spec passed to singe_excerpt can be documented to  
> contain the args passed to add_spec, with default filled in. So  
> $spec->{limit} should be removed and calculated in the default  
> single_excerpt method instead of in add_spec.

Sounds well thought through.  I concur with making single_excerpt  
public() with that API.

We'll need to add one extra named arg to the add_spec list.  "hit"?   
Or actually, how about "doc"?

    # User code:
    my $highlighter = KinoSearch::Highlight::Highlighter->new(
       searcher => $searcher,
    );
    $highlighter->add_spec( name => 'content' );
    my $excerpts = $highlighter->generate_excerpts($hit);

    # Internally, highlighter calls single_excerpt:
    for my $spec ( @{ $specs{$$self} } ) {
       $excerpts->{ $spec->{name} } = $self->single_excerpt(
          %$spec,
          doc => $hit,
       );
    }

The DocVector object would be retrieved within single_excerpt() --  
which becomes possible once the Highlighter gets a Searcher at  
construction time.

I'm a little uncertain about dedicating the name "Hit" to the class  
for the documents that Hits::fetch_hit returns.  Sure, it works, but  
"hit" is used elsewhere, e.g. the HitCollector class, which doesn't  
deal with *this* kind of "hit".  These are essentially Doc objects.   
So I'm thinking make them a subclass of Doc called HitDoc, and have  
the named arg for single_excerpt() be "doc".  Sound good?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list