[KinoSearch] Write a custom analyzer/tokenizer
Peter Karman
peter at peknet.com
Sat Nov 17 18:32:22 PST 2007
This has been sitting in my inbox because I never saw a reply.
Marvin, any tips?
Gea-Suan Lin wrote on 7/28/07 5:25 AM:
> Hello all,
>
> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
> KinoSearch, but I don't know how.
>
> In fact I already write one for Plucene:
>
> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
> http://code.google.com/p/plucene-analysis-utf8/
>
> The algorithm is very simple. When a string with UTF-8 flag on, we can
> use regular expression to extract it, and then generate unigram and
> bigram list:
>
> my $c = '';
> while ($text =~ /([a-z\d]+|\S)/go) {
> next if $1 =~ /\p{P}|\p{Z}/o;
> $tok{$1} = 1;
> $tok{$c . $1} = 1;
> $c = $1;
> }
>
> Then keys %tok will be the list.
>
--
Peter Karman . http://peknet.com/ . peter at peknet.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list