[KinoSearch] Write a custom analyzer/tokenizer
Marvin Humphrey
marvin at rectangular.com
Mon Nov 19 10:21:20 PST 2007
On Nov 17, 2007, at 6:32 PM, Peter Karman wrote:
> This has been sitting in my inbox because I never saw a reply.
>
> Marvin, any tips?
>
> Gea-Suan Lin wrote on 7/28/07 5:25 AM:
>> Hello all,
>>
>> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
>> KinoSearch, but I don't know how.
KinoSearch introduced proper UTF-8 support with version 0.20_01. The
current stable release won't work -- we definitely need the devel
branch.
However, the API for subclassing Analyzer is likely to change.
I'm currently working hard on KinoSearch's underlying OO model, and
this is likely to affect the details of subclassing Analyzer. I'll
have more to say in a bit, but this is the gist:
* KinoSearch::Util::Class and KinoSearch::Util::Obj will be merged
into a new public class, KinoSearch::Obj, which will serve as a
common base and will have documentation on subclassing.
* _ALL_ KS classes will be re-implemented at the Perl level using
the inside-out pattern, surrounding a C struct core. In other
words, they they will all look like MockScorer does now.
This is being done to make it easier to send both objects and method
calls across the Perl-C boundary in KS. I think I'll be done in a
few days.
>> In fact I already write one for Plucene:
>>
>> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
>> http://code.google.com/p/plucene-analysis-utf8/
Two notes.
First, "Plucene::Analysis::UTF8" is not a good choice of namespace
for a CJK Tokenizer. It doesn't advertise CJK, and "UTF8" is hardly
the exclusive province of CJK languages.
Second, prior to the introduction of Plucene::Analysis::UTF8, there
was already another module available on CPAN serving the same need:
Plucene::Analysis::CJKTokenizer. While I certainly understand that
sometimes you have to start fresh, I'm curious why it was necessary
in this case.
>> The algorithm is very simple. When a string with UTF-8 flag on, we
>> can
>> use regular expression to extract it, and then generate unigram and
>> bigram list:
>>
>> my $c = '';
>> while ($text =~ /([a-z\d]+|\S)/go) {
>> next if $1 =~ /\p{P}|\p{Z}/o;
>> $tok{$1} = 1;
>> $tok{$c . $1} = 1;
>> $c = $1;
>> }
That code is somewhat hard to grok because it is so terse. Here's a
key:
\p{P} => \p{Punctuation}
\p{Z} => \p{Separator}
$c => $last_unigram
Also, the I don't believe that /o modifier does anything if there are
no variables being interpolated into the pattern.
Lastly, I'm not sure that performing a regex against $1 like this is
reliable across all versions of Perl.
next if $1 =~ /\p{P}|\p{Z}/o;
Translating that algo into the current Analyzer idiom might look
something like this:
sub analyze_batch {
my ( $self, $batch ) = @_;
my $new_batch = KinoSearch::Analysis::TokenBatch->new;
my $last_unigram = "";
my $last_start_offset;
while ( my $token = $batch->next ) {
for ( $token->get_text ) {
while (/([a-z\d]+|\S)/g) {
my $start_offset = $-[0];
my $end_offset = $+[0];
next if $1 =~ /\p{Punctuation}|\p{Separator}/; # (Is this safe?)
# Bigram token.
if ( defined $last_unigram ) {
my $bigram = $last_unigram . $1;
my $new_bigram_token = KinoSearch::Analysis::Token->new(
text => "$last_unigram$1",
start_offset => $last_start_offset,
end_offset => $end_offset
pos_inc => 0,
);
$new_batch->append($new_bigram_token);
}
# Unigram token.
my $new_unigram_token = KinoSearch::Analysis::Token->new(
text => $1,
start_offset => $start_offset,
end_offset => $end_offset,
);
$new_batch->append($new_unigram_token);
# Seed for next loop iteration.
$last_unigram = $1;
$last_start_offset = $start_offset;
}
}
}
return $new_batch;
}
However, that's going to be horribly slow. The same algorithm can be
implemented MUCH faster using XS. Tokenizing is the biggest index-
time bottleneck, so it's important to be both correct and efficient.
I tried to think of a way to implement this algo using the current
Tokenizer code in conjunction with regex lookahead/lookbehind, but I
couldn't come up with anything.
Maybe we could put together some something like
KSx::Analysis::NGramTokenizer?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list