[KinoSearch] Re: KinoSearch::Highlight::Highlighter

Marvin Humphrey marvin at rectangular.com
Wed Jul 16 23:21:24 PDT 2008




On Jul 16, 2008, at 6:23 AM, Michael Greb wrote:

>>> Perhaps it may make sense to have an argument that allows you to  
>>> specify a character/string to prefer breaking on that defaults to  
>>> '\.'.

Please note that Highlighter's API has changed since the last dev  
release.

Here's Highlighter's current algorithm:

  * Hand the Query a document and ask it what sections
    of the field in question it thinks are important, if any.
    Any "hot" sections are expressed via HighlightSpan
    objects, which define a start_offset, an end_offset,
    and a
floating point "weight".
  * Take all the HighlightSpan objects and create a HeatMap,
    which muxes all the spans plus adds bonus heat whenever
    spans occur close together.
  * Analyze the HeatMap and find the hottest section of the
    field, using boundaries a little larger than the desired
    excerpt size.  (Right now, it's find_best_fragment() that
    does this, but it's not clear that that method needs to
    be public.)
  * Use Highlighter::find_sentence_boundaries to locate
    bounds inside and immediately outside the hot window.
  * Have Highlighter::raw_excerpt determine the formal
    boundaries of the excerpt. Use sentence boundaries when
    possible, but apply ellipses when necessary.
  * Have Highlighter::highlight_excerpt process the raw
    excerpt by applying Highlighter::highlight and
    Highlighter::encode.

The question right now is what the APIs should look like for  
find_sentence_boundaries() and raw_excerpt().  FWIW, they are  
surprisingly hard to implement, because grammatical inconsistencies  
are hard to avoid and there are lots of edge cases.

For starters: Right now, find_sentence_boundaries() returns an array  
of start offsets delimiting sentence starts. However, this is not  
ideal; it would be better to know what the exact end offsets are as  
well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list