[KinoSearch] Write a custom analyzer/tokenizer

Peter Karman peter at peknet.com
Sat Nov 17 18:32:22 PST 2007



This has been sitting in my inbox because I never saw a reply.

Marvin, any tips?

Gea-Suan Lin wrote on 7/28/07 5:25 AM:
> Hello all,
> 
> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
> KinoSearch, but I don't know how.
> 
> In fact I already write one for Plucene:
> 
> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
> http://code.google.com/p/plucene-analysis-utf8/
> 
> The algorithm is very simple. When a string with UTF-8 flag on, we can
> use regular expression to extract it, and then generate unigram and
> bigram list:
> 
>     my $c = '';
>     while ($text =~ /([a-z\d]+|\S)/go) {
> 	next if $1 =~ /\p{P}|\p{Z}/o;
> 	$tok{$1} = 1;
> 	$tok{$c . $1} = 1;
> 	$c = $1;
>     }
> 
> Then keys %tok will be the list.
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list