[KinoSearch] Re: KinoSearch::Highlight::Highlighter
Marvin Humphrey
marvin at rectangular.com
Wed Jul 16 23:21:24 PDT 2008
On Jul 16, 2008, at 6:23 AM, Michael Greb wrote:
>>> Perhaps it may make sense to have an argument that allows you to
>>> specify a character/string to prefer breaking on that defaults to
>>> '\.'.
Please note that Highlighter's API has changed since the last dev
release.
Here's Highlighter's current algorithm:
* Hand the Query a document and ask it what sections
of the field in question it thinks are important, if any.
Any "hot" sections are expressed via HighlightSpan
objects, which define a start_offset, an end_offset,
and a
floating point "weight".
* Take all the HighlightSpan objects and create a HeatMap,
which muxes all the spans plus adds bonus heat whenever
spans occur close together.
* Analyze the HeatMap and find the hottest section of the
field, using boundaries a little larger than the desired
excerpt size. (Right now, it's find_best_fragment() that
does this, but it's not clear that that method needs to
be public.)
* Use Highlighter::find_sentence_boundaries to locate
bounds inside and immediately outside the hot window.
* Have Highlighter::raw_excerpt determine the formal
boundaries of the excerpt. Use sentence boundaries when
possible, but apply ellipses when necessary.
* Have Highlighter::highlight_excerpt process the raw
excerpt by applying Highlighter::highlight and
Highlighter::encode.
The question right now is what the APIs should look like for
find_sentence_boundaries() and raw_excerpt(). FWIW, they are
surprisingly hard to implement, because grammatical inconsistencies
are hard to avoid and there are lots of edge cases.
For starters: Right now, find_sentence_boundaries() returns an array
of start offsets delimiting sentence starts. However, this is not
ideal; it would be better to know what the exact end offsets are as
well.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list