[KinoSearch] German stemmer chokes on words ending in umlaut

Marvin Humphrey marvin at rectangular.com
Sat Jul 4 14:00:14 PDT 2009


On Sat, Jul 04, 2009 at 09:30:56PM +0200, Nick Wellnhofer wrote:
> 
> It seems that line 61 of Stemmer.c should be
> 
>         memcpy(token->text, stemmed_text, len + 1);
> 
> instead of
> 
>         memcpy(stemmed_text, token->text, len + 1);

Wow.  How annoying that that worked at all as was. :(

The bug was hidden because Stemmer was properly resetting the length of the
tokens -- it was effectively functioning as a string truncator.  And in many
cases, truncation and stemming turn out to be equivalent -- such as the
"senator senate" example from the tutorial, and the "peas porridge hot" test
from t/156-stemmer.t.  

Good bug hunting, Nick.  Thanks for the fix; I'll make a new release.

Users who have deployed Stemmer, either directly or via PolyAnalyzer, will
need to regenerate their indexes after installing the bugfix.  However, the
immediate effect will be degraded search results rather than crashes, which
may allow some people a bit of flexibility in how they go about reindexing.

This seems like a good time to mention the "truncate" flag to Indexer's
constructor, which was introduced in 0.30_01.  It allows you to overwrite an
existing index, but the old index data doesn't get zapped until commit()
completes successfully.

Marvin Humphrey




More information about the kinosearch mailing list