[KinoSearch] Write a custom analyzer/tokenizer
Gea-Suan Lin
gslin at gslin.org
Sat Jul 28 03:25:32 PDT 2007
Hello all,
I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
KinoSearch, but I don't know how.
In fact I already write one for Plucene:
http://search.cpan.org/dist/Plucene-Analysis-UTF8/
http://code.google.com/p/plucene-analysis-utf8/
The algorithm is very simple. When a string with UTF-8 flag on, we can
use regular expression to extract it, and then generate unigram and
bigram list:
my $c = '';
while ($text =~ /([a-z\d]+|\S)/go) {
next if $1 =~ /\p{P}|\p{Z}/o;
$tok{$1} = 1;
$tok{$c . $1} = 1;
$c = $1;
}
Then keys %tok will be the list.
--
* Gea-Suan Lin (public key: Using https://keyserver.pgp.com/ to search)
* If you cannot convince them, confuse them. -- Harry S Truman
More information about the KinoSearch
mailing list