[KinoSearch] multilanguage indexing and search

Hugues de Mazancourt hugues at mazancourt.net
Sun Dec 3 09:24:51 PST 2006


Le 2 déc. 06 à 22:23, Alex Aver a écrit :

> 2006/12/1, Marvin Humphrey <marvin at rectangular.com>:
>>
>> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
>>
> [...]
> Why I can't use simple $word_char_tokenizer for this set of languages?
>
> Universal stemmer for mixed texts it's problem. I can separate words
> in latin & cyrillic characters and use special stemmer for Russian
> words. But how can I separate English & French?

You don't necessarily need. 80% of the job an English stemmer does is  
to remove "s"/"es" at the end of  a word, wich works also fine for  
French. The other rules won't hurt (such as s/ed$//) because they  
don't match French words.
  You can also add some French rules in your stemmer, such as s/aux$/ 
al/, wich won't have any effect on English words.

In fact, the most important thing is that you use the *same* stemmer  
for indexing and querying. Whatever stemming it performs.

>
>> Tokenizing Japanese is really, really hard
>> anyway, and KinoSearch provides no native support for it.
>
> Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
> mnogosearch can do index and search in Japanese. But it isn't critical
> point at this moment ;)

MnogosSearch uses ChaSen, a free japanese parser that has a Perl  
front-end. See http://rpmfind.net/linux/RPM/suse/9.3/i386/suse/i586/ 
perl-Text-ChaSen-2.3.3-97.i586.html
More generally, there are some pointers on analyzing Japanese here :  
http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hoary/japanese/

Best,

Hugues




More information about the KinoSearch mailing list