[KinoSearch] multilanguage indexing and search
Hugues de Mazancourt
hugues at mazancourt.net
Sun Dec 3 09:24:51 PST 2006
Le 2 déc. 06 à 22:23, Alex Aver a écrit :
> 2006/12/1, Marvin Humphrey <marvin at rectangular.com>:
>>
>> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
>>
> [...]
> Why I can't use simple $word_char_tokenizer for this set of languages?
>
> Universal stemmer for mixed texts it's problem. I can separate words
> in latin & cyrillic characters and use special stemmer for Russian
> words. But how can I separate English & French?
You don't necessarily need. 80% of the job an English stemmer does is
to remove "s"/"es" at the end of a word, wich works also fine for
French. The other rules won't hurt (such as s/ed$//) because they
don't match French words.
You can also add some French rules in your stemmer, such as s/aux$/
al/, wich won't have any effect on English words.
In fact, the most important thing is that you use the *same* stemmer
for indexing and querying. Whatever stemming it performs.
>
>> Tokenizing Japanese is really, really hard
>> anyway, and KinoSearch provides no native support for it.
>
> Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
> mnogosearch can do index and search in Japanese. But it isn't critical
> point at this moment ;)
MnogosSearch uses ChaSen, a free japanese parser that has a Perl
front-end. See http://rpmfind.net/linux/RPM/suse/9.3/i386/suse/i586/
perl-Text-ChaSen-2.3.3-97.i586.html
More generally, there are some pointers on analyzing Japanese here :
http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hoary/japanese/
Best,
Hugues
More information about the KinoSearch
mailing list