[KinoSearch] Write a custom analyzer/tokenizer
☼ 林永忠 ☼ (Yung-chung Lin)
henearkrxern at gmail.com
Mon Mar 10 21:42:39 PDT 2008
Hi,
I created a cjk tokenizer written in C++. Please see here:
http://code.google.com/p/cjk-tokenizer
Hope it will be useful to people who need to index CJK texts.
Best,
Yung-chung Lin
On Tue, Nov 20, 2007 at 2:21 AM, Marvin Humphrey <marvin at rectangular.com> wrote:
>
> On Nov 17, 2007, at 6:32 PM, Peter Karman wrote:
>
> > This has been sitting in my inbox because I never saw a reply.
> >
> > Marvin, any tips?
> >
> > Gea-Suan Lin wrote on 7/28/07 5:25 AM:
> >> Hello all,
> >>
> >> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
> >> KinoSearch, but I don't know how.
>
> KinoSearch introduced proper UTF-8 support with version 0.20_01. The
> current stable release won't work -- we definitely need the devel
> branch.
>
> However, the API for subclassing Analyzer is likely to change.
>
> I'm currently working hard on KinoSearch's underlying OO model, and
> this is likely to affect the details of subclassing Analyzer. I'll
> have more to say in a bit, but this is the gist:
>
> * KinoSearch::Util::Class and KinoSearch::Util::Obj will be merged
> into a new public class, KinoSearch::Obj, which will serve as a
> common base and will have documentation on subclassing.
> * _ALL_ KS classes will be re-implemented at the Perl level using
> the inside-out pattern, surrounding a C struct core. In other
> words, they they will all look like MockScorer does now.
>
> This is being done to make it easier to send both objects and method
> calls across the Perl-C boundary in KS. I think I'll be done in a
> few days.
>
>
> >> In fact I already write one for Plucene:
> >>
> >> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
> >> http://code.google.com/p/plucene-analysis-utf8/
>
> Two notes.
>
> First, "Plucene::Analysis::UTF8" is not a good choice of namespace
> for a CJK Tokenizer. It doesn't advertise CJK, and "UTF8" is hardly
> the exclusive province of CJK languages.
>
> Second, prior to the introduction of Plucene::Analysis::UTF8, there
> was already another module available on CPAN serving the same need:
> Plucene::Analysis::CJKTokenizer. While I certainly understand that
> sometimes you have to start fresh, I'm curious why it was necessary
> in this case.
>
>
> >> The algorithm is very simple. When a string with UTF-8 flag on, we
> >> can
> >> use regular expression to extract it, and then generate unigram and
> >> bigram list:
> >>
> >> my $c = '';
> >> while ($text =~ /([a-z\d]+|\S)/go) {
> >> next if $1 =~ /\p{P}|\p{Z}/o;
> >> $tok{$1} = 1;
> >> $tok{$c . $1} = 1;
> >> $c = $1;
> >> }
>
> That code is somewhat hard to grok because it is so terse. Here's a
> key:
>
> \p{P} => \p{Punctuation}
> \p{Z} => \p{Separator}
> $c => $last_unigram
>
> Also, the I don't believe that /o modifier does anything if there are
> no variables being interpolated into the pattern.
>
> Lastly, I'm not sure that performing a regex against $1 like this is
> reliable across all versions of Perl.
>
>
> next if $1 =~ /\p{P}|\p{Z}/o;
>
> Translating that algo into the current Analyzer idiom might look
> something like this:
>
> sub analyze_batch {
> my ( $self, $batch ) = @_;
> my $new_batch = KinoSearch::Analysis::TokenBatch->new;
> my $last_unigram = "";
> my $last_start_offset;
>
> while ( my $token = $batch->next ) {
> for ( $token->get_text ) {
>
> while (/([a-z\d]+|\S)/g) {
> my $start_offset = $-[0];
> my $end_offset = $+[0];
>
> next if $1 =~ /\p{Punctuation}|\p{Separator}/; # (Is this safe?)
>
> # Bigram token.
> if ( defined $last_unigram ) {
> my $bigram = $last_unigram . $1;
> my $new_bigram_token = KinoSearch::Analysis::Token->new(
> text => "$last_unigram$1",
> start_offset => $last_start_offset,
> end_offset => $end_offset
> pos_inc => 0,
> );
> $new_batch->append($new_bigram_token);
> }
>
> # Unigram token.
> my $new_unigram_token = KinoSearch::Analysis::Token->new(
> text => $1,
> start_offset => $start_offset,
> end_offset => $end_offset,
> );
> $new_batch->append($new_unigram_token);
>
> # Seed for next loop iteration.
> $last_unigram = $1;
> $last_start_offset = $start_offset;
> }
> }
> }
>
> return $new_batch;
> }
>
> However, that's going to be horribly slow. The same algorithm can be
> implemented MUCH faster using XS. Tokenizing is the biggest index-
> time bottleneck, so it's important to be both correct and efficient.
>
> I tried to think of a way to implement this algo using the current
> Tokenizer code in conjunction with regex lookahead/lookbehind, but I
> couldn't come up with anything.
>
> Maybe we could put together some something like
> KSx::Analysis::NGramTokenizer?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch at rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>
More information about the KinoSearch
mailing list