[KinoSearch] Write a custom analyzer/tokenizer

☼ 林永忠 ☼ (Yung-chung Lin) henearkrxern at gmail.com
Mon Mar 10 21:42:39 PDT 2008



Hi,

I created a cjk tokenizer written in C++. Please see here:
http://code.google.com/p/cjk-tokenizer

Hope it will be useful to people who need to index CJK texts.

Best,
Yung-chung Lin

On Tue, Nov 20, 2007 at 2:21 AM, Marvin Humphrey <marvin at rectangular.com> wrote:
>
>  On Nov 17, 2007, at 6:32 PM, Peter Karman wrote:
>
>  > This has been sitting in my inbox because I never saw a reply.
>  >
>  > Marvin, any tips?
>  >
>  > Gea-Suan Lin wrote on 7/28/07 5:25 AM:
>  >> Hello all,
>  >>
>  >> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
>  >> KinoSearch, but I don't know how.
>
>  KinoSearch introduced proper UTF-8 support with version 0.20_01.  The
>  current stable release won't work -- we definitely need the devel
>  branch.
>
>  However, the API for subclassing Analyzer is likely to change.
>
>  I'm currently working hard on KinoSearch's underlying OO model, and
>  this is likely to affect the details of subclassing Analyzer.  I'll
>  have more to say in a bit, but this is the gist:
>
>     * KinoSearch::Util::Class and KinoSearch::Util::Obj will be merged
>       into a new public class, KinoSearch::Obj, which will serve as a
>       common base and will have documentation on subclassing.
>     * _ALL_ KS classes will be re-implemented at the Perl level using
>       the inside-out pattern, surrounding a C struct core.  In other
>       words, they they will all look like MockScorer does now.
>
>  This is being done to make it easier to send both objects and method
>  calls across the Perl-C boundary in KS.  I think I'll be done in a
>  few days.
>
>
>  >> In fact I already write one for Plucene:
>  >>
>  >> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
>  >> http://code.google.com/p/plucene-analysis-utf8/
>
>  Two notes.
>
>  First, "Plucene::Analysis::UTF8" is not a good choice of namespace
>  for a CJK Tokenizer.  It doesn't advertise CJK, and "UTF8" is hardly
>  the exclusive province of CJK languages.
>
>  Second, prior to the introduction of Plucene::Analysis::UTF8, there
>  was already another module available on CPAN serving the same need:
>  Plucene::Analysis::CJKTokenizer.  While I certainly understand that
>  sometimes you have to start fresh, I'm curious why it was necessary
>  in this case.
>
>
>  >> The algorithm is very simple. When a string with UTF-8 flag on, we
>  >> can
>  >> use regular expression to extract it, and then generate unigram and
>  >> bigram list:
>  >>
>  >>     my $c = '';
>  >>     while ($text =~ /([a-z\d]+|\S)/go) {
>  >>      next if $1 =~ /\p{P}|\p{Z}/o;
>  >>      $tok{$1} = 1;
>  >>      $tok{$c . $1} = 1;
>  >>      $c = $1;
>  >>     }
>
>  That code is somewhat hard to grok because it is so terse.  Here's a
>  key:
>
>    \p{P} => \p{Punctuation}
>    \p{Z} => \p{Separator}
>    $c    => $last_unigram
>
>  Also, the I don't believe that /o modifier does anything if there are
>  no variables being interpolated into the pattern.
>
>  Lastly, I'm not sure that performing a regex against $1 like this is
>  reliable across all versions of Perl.
>
>
>         next if $1 =~ /\p{P}|\p{Z}/o;
>
>  Translating that algo into the current Analyzer idiom might look
>  something like this:
>
>    sub analyze_batch {
>      my ( $self, $batch ) = @_;
>      my $new_batch = KinoSearch::Analysis::TokenBatch->new;
>      my $last_unigram = "";
>      my $last_start_offset;
>
>      while ( my $token = $batch->next ) {
>        for ( $token->get_text ) {
>
>          while (/([a-z\d]+|\S)/g) {
>            my $start_offset = $-[0];
>            my $end_offset   = $+[0];
>
>           next if $1 =~ /\p{Punctuation}|\p{Separator}/; # (Is this safe?)
>
>            # Bigram token.
>            if ( defined $last_unigram ) {
>              my $bigram = $last_unigram . $1;
>              my $new_bigram_token = KinoSearch::Analysis::Token->new(
>                text         => "$last_unigram$1",
>                start_offset => $last_start_offset,
>                end_offset   => $end_offset
>                pos_inc      => 0,
>              );
>              $new_batch->append($new_bigram_token);
>            }
>
>            # Unigram token.
>            my $new_unigram_token = KinoSearch::Analysis::Token->new(
>              text         => $1,
>              start_offset => $start_offset,
>              end_offset   => $end_offset,
>            );
>            $new_batch->append($new_unigram_token);
>
>            # Seed for next loop iteration.
>            $last_unigram      = $1;
>            $last_start_offset = $start_offset;
>          }
>        }
>      }
>
>      return $new_batch;
>    }
>
>  However, that's going to be horribly slow.  The same algorithm can be
>  implemented MUCH faster using XS.  Tokenizing is the biggest index-
>  time bottleneck, so it's important to be both correct and efficient.
>
>  I tried to think of a way to implement this algo using the current
>  Tokenizer code in conjunction with regex lookahead/lookbehind, but I
>  couldn't come up with anything.
>
>  Maybe we could put together some something like
>  KSx::Analysis::NGramTokenizer?
>
>  Marvin Humphrey
>  Rectangular Research
>  http://www.rectangular.com/
>
>
>
>
>
>  _______________________________________________
>  KinoSearch mailing list
>  KinoSearch at rectangular.com
>  http://www.rectangular.com/mailman/listinfo/kinosearch
>

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list