[KinoSearch] Write a custom analyzer/tokenizer

Gea-Suan Lin gslin at gslin.org
Sat Jul 28 03:25:32 PDT 2007


Hello all,

I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
KinoSearch, but I don't know how.

In fact I already write one for Plucene:

http://search.cpan.org/dist/Plucene-Analysis-UTF8/
http://code.google.com/p/plucene-analysis-utf8/

The algorithm is very simple. When a string with UTF-8 flag on, we can
use regular expression to extract it, and then generate unigram and
bigram list:

    my $c = '';
    while ($text =~ /([a-z\d]+|\S)/go) {
	next if $1 =~ /\p{P}|\p{Z}/o;
	$tok{$1} = 1;
	$tok{$c . $1} = 1;
	$c = $1;
    }

Then keys %tok will be the list.

-- 
* Gea-Suan Lin  (public key: Using https://keyserver.pgp.com/ to search)
* If you cannot convince them, confuse them.           -- Harry S Truman



More information about the KinoSearch mailing list