[KinoSearch] German stemmer chokes on words ending in umlaut
Nick Wellnhofer
wellnhofer at aevum.de
Sat Jul 4 11:47:03 PDT 2009
This is probably a bug in Lingua::Stem::Snowball. I hope it's okay to post this here.
If I run the attached script using Kinosearch 0.30_03 I get the following error message:
Invalid UTF-8, aborting: 'xxxxxxx▒'
Invalid UTF-8. at core/KinoSearch/Util/CharBuf.c:168 S_die_invalid_utf8
at stem_test.pl line 29
The invalid UTF8 sequence is C3 27. This only happens when using the German stemmer on words ending in an umlaut (ä, ö, ü). Words with an umlaut in the middle work fine.
Nick
--
aevum gmbh
rumfordstr. 4
80469 münchen
germany
tel: +49 89 3838 0653
http://aevum.de/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stem_test.pl
Url: http://rectangular.com/pipermail/kinosearch/attachments/20090704/c66c44c9/attachment.txt
More information about the kinosearch
mailing list