[KinoSearch] Error: Maximum token length is 65535

Marvin Humphrey marvin at rectangular.com
Mon Jul 14 22:07:48 PDT 2008




On Jul 14, 2008, at 6:18 AM, Riyaad Miller wrote:

> I'm using KS 0.162. When using the following code, the error below  
> is produced:
>
> My Definitions
> my $stemmer    = KinoSearch::Analysis::Stemmer->new( language =>  
> 'en' );
> my $stopalizer  = KinoSearch::Analysis::Stopalizer->new(language =>  
> 'en');
> my $analyzer    = KinoSearch::Analysis::PolyAnalyzer->new(analyzers  
> => [$stemmer, $stopalizer]);
>
> The Error
> Maximum token length is 65535; got 107462

You have a PolyAnalyzer which contains a Stemmer and a Stopalizer, but  
not a Tokenizer.  Thus, the entire field value, all 107462 characters  
of it, is the only token.

Theoretically, if KS had completed indexing successfully rather than  
choked on that value, and at search-time someone were to type in the  
appropriate 100,000+ character search string, you might get a hit.

Whatever those 107462 characters are, I can guarantee you that nothing  
that long exists in the english stop list.  Similarly, I doubt the  
Stemmer has anything useful to say about the last few characters of  
that field.

You really need a Tokenizer.  You probably also want an LCNormalizer  
in there unless you really want searches to be case sensitive.

   my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
   my $tokenizer     = KinoSearch::Analysis::Tokenizer->new;
   my $stemmer       = KinoSearch::Analysis::Stemmer->new(
      language => 'en',
   );
   my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
      language => 'en',
   );
   my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
     analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
   );

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list