[KinoSearch] Subclassable Highlighter

Marvin Humphrey marvin at rectangular.com
Mon Jan 28 17:27:38 PST 2008




On Jan 28, 2008, at 3:39 PM, Father Chrysostomos wrote:

> On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:
>
>>  my $highlighter = KinoSearch::Highlight::Highlighter->new(
>>      searcher => $searcher,
>>      query    => $query,
>>  );
>
> Another problem with this approach is that the highlighter can only  
> be used for one query. If a second search is made with the same  
> $searcher, another highlighter is needed.

True, but I can't think of where that would cause a problem.  Can you  
think of one?

In contrast, if we add a Query member to each HitDoc, that means the  
Query will have to be serialized/deserialized if we send the hit over  
the network.

Part of the reason that Highlighter's API looks the way it does was  
the limitation that Highlighters had to do their work from inside a  
Hits object.  That was a kludge, necessitated by the fact that it was  
possible to know the doc_num from within the Hits object (and thus  
possible to fetch the relevant DocVector), but impossible to know the  
doc_num from the hashref returned by $hits->fetch_hit_hashref.

Now that we're about to return a HitDoc object instead of a plain  
hashref, we're not bound by that constraint, and I'm very much  
looking forward to zapping Hits::create_excerpts.

In fact, we could simplify further.  Now that we don't have to stick  
all our excerpts into $hashref->{excerpts}, we can return the  
excerpts as scalars, one-at-a-time -- eliminating both add_spec() and  
generate_excerpts().

   my $highlighter = KinoSearch::Highlight::Highlighter->new(
     searcher       => $searcher,   # required
     query          => $query,      # required
     field          => 'content',   # required
     excerpt_length => 150,         # default: 200
     formatter      => $formatter,  # default: a SimpleHTMLFormatter
     encoder        => $encoder,    # default: a SimpleHTMLEncoder
   );
   for my $hit ( $hits->fetch_hit ) {
      my $excerpt = $highlighter->single_excerpt($hit);
      ...
   }

Juggling how params get set is a superficial change compared with  
e.g. making single_excerpt() public, so it isn't that important.   
However, I wonder if this lighter-weight vision for a highlighter  
makes you more comfortable.  To my mind, it's OK if highlighters are  
ephemeral and you create a new one for each query.

> Unless $searcher can have a ->get_last_query method....

Yikes, that'd be asking for trouble!

> Also, when it comes to the highlight_data method, which class  
> should be responsible for removing duplicate HighlightSpans? Should  
> I make this a method of Highlighter itself?

When would there be duplicates?  I suppose you'd see the same  
positions multiple times for a query like 'lincoln "lincoln  
bedroom"', but you'd get different weights.  That query would  
probably yield two spans with data like this...

      {  start_offset => 15, end_offset => 22, weight 1.2 }
      {  start_offset => 15, end_offset => 30, weight 3.5 }

... with the second span having a higher weight to reflect the  
relative rarity of the phrase compared to the single term.

> I don’t remember whether I told you: I’m working on these changes  
> to Highlighter, and I think I will have a patch ready soon.

I'm working on the Doc class right now.  You should see some commits  
over the next few hours.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list