[KinoSearch] German stemmer chokes on words ending in umlaut

Nick Wellnhofer wellnhofer at aevum.de
Sat Jul 4 11:47:03 PDT 2009


This is probably a bug in Lingua::Stem::Snowball. I hope it's okay to post this here.

If I run the attached script using Kinosearch 0.30_03 I get the following error message:

Invalid UTF-8, aborting: 'xxxxxxx▒'
Invalid UTF-8. at core/KinoSearch/Util/CharBuf.c:168 S_die_invalid_utf8
         at stem_test.pl line 29

The invalid UTF8 sequence is C3 27. This only happens when using the German stemmer on words ending in an umlaut (ä, ö, ü). Words with an umlaut in the middle work fine.

Nick


-- 
aevum gmbh
rumfordstr. 4
80469 münchen
germany

tel: +49 89 3838 0653
http://aevum.de/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stem_test.pl
Url: http://rectangular.com/pipermail/kinosearch/attachments/20090704/c66c44c9/attachment.txt 


More information about the kinosearch mailing list