[KinoSearch] TokenBatch API (was Content indexing question)

Marvin Humphrey marvin at rectangular.com
Fri Nov 3 10:12:44 PST 2006


On Nov 3, 2006, at 6:22 AM, Peter Karman wrote:

> If you need an easy hack that won't be affected by KS API changes,  
> you'd need to do it in your parser. Something like combining with a  
> nonsense word that would still work with the default token_re:
>
>  $doc->set_value('h1' => join(' nosuchwordhere ', @h1_contents));
>  # note the spaces around ' nosuchwordhere '
>
> Unless someone searches for 'nosuchwordhere' they'd never get a  
> match across block tag boundaries.

Yeah, that'd work.  Nice. :)

Thanks for stepping in with the support help and freeing me up to  
spend time on the rest of this post.

> aside to Marvin: I hope TokenBatch doesn't go away altogether;  
> Swish3 assumes there will be a quick way to add a word list in a  
> tight loop.

TokenBatch will not go away.  It's marked that way because the API  
isn't what I would like it to be.

TokenBatch is very important.  KinoSearch does not offer a  
smorgasboard of Tokenizers a la Lucene.  There's only the one regex- 
based Tokenizer, and TokenBatch for power users who want to roll  
their own Analyzers.  I'm counting on TokenBatch to keep the core a  
manageable size, and that's a policy I hope to propagate to Lucy.  I  
think it's better to focus on making TokenBatch as good as it can be,  
rather than on supporting a slew of specialized Analyzer subclasses.

When TokenBatch's API was created, I hadn't figured out how to expose  
a Token in Perl-space without imposing the requirement that _every_  
token get a Perl object _everywhere_.  That would have slowed down  
indexing substantially for the general case.  Fortunately, the  
reference-counting scheme that you and Dave showed me on lucy-dev  
solves that problem.  So now I am able to update TokenBatch's API to  
what it should have been in the first place.

Here are the changes planned for Token and TokenBatch in 0.20:

  * Token will get a public API and its own constructor.
  * All of the accessor methods for manipulating the current token
    will be removed from TokenBatch and given to Token instead.
  * TokenBatch's append() method will be changed to take a Token
    object.  It will no longer serve as a factory method; you
    must use the Token constructor to create an object which you
    then add to the batch via $batch->append($token);
  * A fetch_next() method which returns a Token will be added
    to TokenBatch.

Currently, if you want to iterate over the tokens in an existing  
TokenBatch and transform them somehow, this is what you do:

   while ( $batch->next ) {
        $batch->set_text( lc( $batch->get_text ) );
   }

This is better, IMO:

   while ( my $token = $batch->fetch_next ) {
        $token->set_text( lc( $token->get_text ) );
   }

Here's how the change in behavior for append() will affect things:

   # current:
   $batch->append( $text, $start, $end );

   # new:
   my $token = KinoSearch::Analysis::Token->new( $text, $start, $end );
   $batch->append($token);

That's more verbose, but conceptually cleaner.  (And power users will  
want to use something other than append() anyway -- read on.)

There's another method I've considered adding: insert(), which would  
splice a new Token into the batch either before or after the current  
one.  However, I think that adding insert() is unnecessary, since you  
can effectively transform a TokenBatch by creating a new one and  
moving tokens from the old one, processing en route.  Here's how a  
SynonymAnalyzer could be implemented:

   while ( my $token = $batch->next ) {
       # copy the current token over
       $new_batch->append($token);

       # add a new token with a pos_inc of 0 for each synonym
       my $synonyms = get_synonyms( $token->get_text );
       for (@$synonyms) {
           my $synonym_token = $token->clone;
           $synonym_token->set_text($_);
           $synonym_token->set_pos_inc(0);
           $new_batch->append($synonym_token);
       }
   }

I think that's clearer, because append() doesn't have the ambiguity  
that insert() would with regards to just where the Token gets  
inserted.  So, I think insert() will not see the light of day.

Additionally, for the benefit of über-hackers such as yourself, I  
would like to expose an API for adding many tokens at once.  That's  
important for maximizing performance.  I know you've wrestled with  
this issue so I'll be curious what you have to say.

TokenBatch presently has a private add_many_tokens() method, which  
takes a string, an array of starts, and an array of ends.  The starts/ 
ends are bytecounts, rather than Unicode code points.

   my $string = 'i am the walrus';
   my @starts = ( 0, 2, 5, 9 );
   my @ends   = ( 1, 4, 8, 15 );
   $batch->add_many_tokens( $string, \@starts, \@ends );

Here's how add_many_tokens would look if it were implemented in Perl:

   sub add_many_tokens {
       my ( $batch, $orig_string, $starts, $ends ) = @_;
       for my $i ( 0 .. $#$starts ) {
           my $start    = $starts->{$i};
           my $end      = $ends->{$i};
           my $len      = $end - $start;
           my $string   = bytes::substr( $orig_string, $start, $len );
           my $token
               = KinoSearch::Analysis::Token->new( $string, $start,  
$end );
           $batch->append($token);
       }
   }

The C version of that is fastest way I've found for taking input from  
a Tokenizer.  Grabbing substrings at the C level based on start and  
end points is considerably faster than using regex captures from Perl  
space.  However, that API less intuitive, and less flexible.  So  
perhaps we need two advanced APIs.

    # new method
    $batch->gen_tokens( \@strings, \@starts \@ends );

    # add_many_tokens, renamed
    $batch->gen_tokens_substr( $string, \@starts, \@ends );

Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list