[KinoSearch] Write a custom analyzer/tokenizer

Marvin Humphrey marvin at rectangular.com
Mon Nov 19 10:21:20 PST 2007




On Nov 17, 2007, at 6:32 PM, Peter Karman wrote:

> This has been sitting in my inbox because I never saw a reply.
>
> Marvin, any tips?
>
> Gea-Suan Lin wrote on 7/28/07 5:25 AM:
>> Hello all,
>>
>> I want to write a custom analyzer/tokenizer for CJK UTF-8 string in
>> KinoSearch, but I don't know how.

KinoSearch introduced proper UTF-8 support with version 0.20_01.  The  
current stable release won't work -- we definitely need the devel  
branch.

However, the API for subclassing Analyzer is likely to change.

I'm currently working hard on KinoSearch's underlying OO model, and  
this is likely to affect the details of subclassing Analyzer.  I'll  
have more to say in a bit, but this is the gist:

    * KinoSearch::Util::Class and KinoSearch::Util::Obj will be merged
      into a new public class, KinoSearch::Obj, which will serve as a
      common base and will have documentation on subclassing.
    * _ALL_ KS classes will be re-implemented at the Perl level using
      the inside-out pattern, surrounding a C struct core.  In other
      words, they they will all look like MockScorer does now.

This is being done to make it easier to send both objects and method  
calls across the Perl-C boundary in KS.  I think I'll be done in a  
few days.

>> In fact I already write one for Plucene:
>>
>> http://search.cpan.org/dist/Plucene-Analysis-UTF8/
>> http://code.google.com/p/plucene-analysis-utf8/

Two notes.

First, "Plucene::Analysis::UTF8" is not a good choice of namespace  
for a CJK Tokenizer.  It doesn't advertise CJK, and "UTF8" is hardly  
the exclusive province of CJK languages.

Second, prior to the introduction of Plucene::Analysis::UTF8, there  
was already another module available on CPAN serving the same need:  
Plucene::Analysis::CJKTokenizer.  While I certainly understand that  
sometimes you have to start fresh, I'm curious why it was necessary  
in this case.

>> The algorithm is very simple. When a string with UTF-8 flag on, we  
>> can
>> use regular expression to extract it, and then generate unigram and
>> bigram list:
>>
>>     my $c = '';
>>     while ($text =~ /([a-z\d]+|\S)/go) {
>> 	next if $1 =~ /\p{P}|\p{Z}/o;
>> 	$tok{$1} = 1;
>> 	$tok{$c . $1} = 1;
>> 	$c = $1;
>>     }

That code is somewhat hard to grok because it is so terse.  Here's a  
key:

   \p{P} => \p{Punctuation}
   \p{Z} => \p{Separator}
   $c    => $last_unigram

Also, the I don't believe that /o modifier does anything if there are  
no variables being interpolated into the pattern.

Lastly, I'm not sure that performing a regex against $1 like this is  
reliable across all versions of Perl.

	next if $1 =~ /\p{P}|\p{Z}/o;

Translating that algo into the current Analyzer idiom might look  
something like this:

   sub analyze_batch {
     my ( $self, $batch ) = @_;
     my $new_batch = KinoSearch::Analysis::TokenBatch->new;
     my $last_unigram = "";
     my $last_start_offset;

     while ( my $token = $batch->next ) {
       for ( $token->get_text ) {

         while (/([a-z\d]+|\S)/g) {
           my $start_offset = $-[0];
           my $end_offset   = $+[0];

	  next if $1 =~ /\p{Punctuation}|\p{Separator}/; # (Is this safe?)

           # Bigram token.
           if ( defined $last_unigram ) {
             my $bigram = $last_unigram . $1;
             my $new_bigram_token = KinoSearch::Analysis::Token->new(
               text         => "$last_unigram$1",
               start_offset => $last_start_offset,
               end_offset   => $end_offset
               pos_inc      => 0,
             );
             $new_batch->append($new_bigram_token);
           }

           # Unigram token.
           my $new_unigram_token = KinoSearch::Analysis::Token->new(
             text         => $1,
             start_offset => $start_offset,
             end_offset   => $end_offset,
           );
           $new_batch->append($new_unigram_token);

           # Seed for next loop iteration.
           $last_unigram      = $1;
           $last_start_offset = $start_offset;
         }
       }
     }

     return $new_batch;
   }

However, that's going to be horribly slow.  The same algorithm can be  
implemented MUCH faster using XS.  Tokenizing is the biggest index- 
time bottleneck, so it's important to be both correct and efficient.

I tried to think of a way to implement this algo using the current  
Tokenizer code in conjunction with regex lookahead/lookbehind, but I  
couldn't come up with anything.

Maybe we could put together some something like  
KSx::Analysis::NGramTokenizer?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list