[KinoSearch] Subclassable Highlighter (was Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Wed Jan 23 22:33:10 PST 2008




On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:

> fwiw, Search::Tools offers highlighting and excerpting (snipping)  
> via the building of
> complex regular expressions. See
> http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/ 
> Snipper.pm
> http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/ 
> HiLiter.pm
>
> The algorithm I use for snipping/excerpting is slow, and I would  
> love to see how a
> different approach could improve performance. I believe the primary  
> reason my approach is
> slow is that it uses a big regex.

KinoSearch's highlighter is fast because it utilizes information  
generated at index time and stored in the "term vectors" file.  Each  
"vectorized" field's data consists of...

   * Term text.
   * Each term's position in the field, measured in tokens.
   * Each position's start offset, measured in Unicode code points.
   * Each position's end offset, measured in Unicode code points.

Because the start offset and end offset are stored, it is possible to  
highlight stemmed terms accurately.  For instance, if a field starts  
off with "Horses are fast", the stemmed text "hors" is stored along  
with a start offset of 0 and an end offset of 6, allowing us to  
insert highlighting emphasis marks at those positions.  The same  
technique could be used to e.g. highlight synonyms after synonym  
analysis.

The essence of the Highlighter is that after we have a result set, we  
rerun the query against the documents one-at-a-time and see what  
parts are most important.  For this to work, we need...

   * Query/Scorer classes which are capable of telling us why they  
scored
     a document the way they did.  Right now, this is done via
     $query->extract_terms, but that's a crude mechanism that will not
     hold up for esoteric subclasses of Query.
   * Access to the parsed, analyzed document.

If we did not store the "term vectors" information, we would have the  
option of rerunning analysis on the fly.  Unfortunately, this doesn't  
work well if you have either large documents or costly Analyzer  
chains.  So, storing some serialized version of the parsed document  
which can be reassembled into an object quickly will remain a crucial  
facet of the KinoSearch highlighter.

I wish it were realistic to perform analysis on the fly, because then  
it would not be necessary to worry about the file format of  
persistent term vector data within the index.  TermVectors probably  
won't be part of the official file spec, in order to limit the  
clutter.  However, for backwards compatibility purposes, we'll still  
be stuck with the format once it's set.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list