[KinoSearch] Error: Maximum token length is 65535
Marvin Humphrey
marvin at rectangular.com
Mon Jul 14 22:07:48 PDT 2008
On Jul 14, 2008, at 6:18 AM, Riyaad Miller wrote:
> I'm using KS 0.162. When using the following code, the error below
> is produced:
>
> My Definitions
> my $stemmer = KinoSearch::Analysis::Stemmer->new( language =>
> 'en' );
> my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language =>
> 'en');
> my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers
> => [$stemmer, $stopalizer]);
>
> The Error
> Maximum token length is 65535; got 107462
You have a PolyAnalyzer which contains a Stemmer and a Stopalizer, but
not a Tokenizer. Thus, the entire field value, all 107462 characters
of it, is the only token.
Theoretically, if KS had completed indexing successfully rather than
choked on that value, and at search-time someone were to type in the
appropriate 100,000+ character search string, you might get a hit.
Whatever those 107462 characters are, I can guarantee you that nothing
that long exists in the english stop list. Similarly, I doubt the
Stemmer has anything useful to say about the last few characters of
that field.
You really need a Tokenizer. You probably also want an LCNormalizer
in there unless you really want searches to be case sensitive.
my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $stemmer = KinoSearch::Analysis::Stemmer->new(
language => 'en',
);
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
language => 'en',
);
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
);
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list