[KinoSearch] Subclassable Highlighter
Marvin Humphrey
marvin at rectangular.com
Mon Jan 28 17:27:38 PST 2008
On Jan 28, 2008, at 3:39 PM, Father Chrysostomos wrote:
> On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:
>
>> my $highlighter = KinoSearch::Highlight::Highlighter->new(
>> searcher => $searcher,
>> query => $query,
>> );
>
> Another problem with this approach is that the highlighter can only
> be used for one query. If a second search is made with the same
> $searcher, another highlighter is needed.
True, but I can't think of where that would cause a problem. Can you
think of one?
In contrast, if we add a Query member to each HitDoc, that means the
Query will have to be serialized/deserialized if we send the hit over
the network.
Part of the reason that Highlighter's API looks the way it does was
the limitation that Highlighters had to do their work from inside a
Hits object. That was a kludge, necessitated by the fact that it was
possible to know the doc_num from within the Hits object (and thus
possible to fetch the relevant DocVector), but impossible to know the
doc_num from the hashref returned by $hits->fetch_hit_hashref.
Now that we're about to return a HitDoc object instead of a plain
hashref, we're not bound by that constraint, and I'm very much
looking forward to zapping Hits::create_excerpts.
In fact, we could simplify further. Now that we don't have to stick
all our excerpts into $hashref->{excerpts}, we can return the
excerpts as scalars, one-at-a-time -- eliminating both add_spec() and
generate_excerpts().
my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher, # required
query => $query, # required
field => 'content', # required
excerpt_length => 150, # default: 200
formatter => $formatter, # default: a SimpleHTMLFormatter
encoder => $encoder, # default: a SimpleHTMLEncoder
);
for my $hit ( $hits->fetch_hit ) {
my $excerpt = $highlighter->single_excerpt($hit);
...
}
Juggling how params get set is a superficial change compared with
e.g. making single_excerpt() public, so it isn't that important.
However, I wonder if this lighter-weight vision for a highlighter
makes you more comfortable. To my mind, it's OK if highlighters are
ephemeral and you create a new one for each query.
> Unless $searcher can have a ->get_last_query method....
Yikes, that'd be asking for trouble!
> Also, when it comes to the highlight_data method, which class
> should be responsible for removing duplicate HighlightSpans? Should
> I make this a method of Highlighter itself?
When would there be duplicates? I suppose you'd see the same
positions multiple times for a query like 'lincoln "lincoln
bedroom"', but you'd get different weights. That query would
probably yield two spans with data like this...
{ start_offset => 15, end_offset => 22, weight 1.2 }
{ start_offset => 15, end_offset => 30, weight 3.5 }
... with the second span having a higher weight to reflect the
relative rarity of the phrase compared to the single term.
> I don’t remember whether I told you: I’m working on these changes
> to Highlighter, and I think I will have a patch ready soon.
I'm working on the Doc class right now. You should see some commits
over the next few hours.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list