[KinoSearch] TokenBatch API (was Content indexing question)

Peter Karman peter at peknet.com
Sat Nov 4 13:30:54 PST 2006



Marvin Humphrey scribbled on 11/3/06 12:12 PM:

> Thanks for stepping in with the support help and freeing me up to spend 
> time on the rest of this post.
> 

like Robert De Niro says in that great scene in the film _Brazil_: we're all in 
this together. ;)


>> aside to Marvin: I hope TokenBatch doesn't go away altogether; Swish3 
>> assumes there will be a quick way to add a word list in a tight loop.
> 
> TokenBatch will not go away.  It's marked that way because the API isn't 
> what I would like it to be.

cool.

> 
> Here are the changes planned for Token and TokenBatch in 0.20:
> 
>  * Token will get a public API and its own constructor.
>  * All of the accessor methods for manipulating the current token
>    will be removed from TokenBatch and given to Token instead.
>  * TokenBatch's append() method will be changed to take a Token
>    object.  It will no longer serve as a factory method; you
>    must use the Token constructor to create an object which you
>    then add to the batch via $batch->append($token);
>  * A fetch_next() method which returns a Token will be added
>    to TokenBatch.

nice.


> This is better, IMO:
> 
>   while ( my $token = $batch->fetch_next ) {
>        $token->set_text( lc( $token->get_text ) );
>   }
> 

agreed.

> Here's how the change in behavior for append() will affect things:
> 
>   # current:
>   $batch->append( $text, $start, $end );
> 
>   # new:
>   my $token = KinoSearch::Analysis::Token->new( $text, $start, $end );
>   $batch->append($token);
> 
> That's more verbose, but conceptually cleaner.  (And power users will 
> want to use something other than append() anyway -- read on.)
> 

agreed, on both counts.

> There's another method I've considered adding: insert(), which would 
> splice a new Token into the batch either before or after the current 
> one.  However, I think that adding insert() is unnecessary, since you 
> can effectively transform a TokenBatch by creating a new one and moving 
> tokens from the old one, processing en route.  Here's how a 
> SynonymAnalyzer could be implemented:
> 
>   while ( my $token = $batch->next ) {
>       # copy the current token over
>       $new_batch->append($token);
> 
>       # add a new token with a pos_inc of 0 for each synonym
>       my $synonyms = get_synonyms( $token->get_text );
>       for (@$synonyms) {
>           my $synonym_token = $token->clone;
>           $synonym_token->set_text($_);
>           $synonym_token->set_pos_inc(0);
>           $new_batch->append($synonym_token);
>       }
>   }
> 
> I think that's clearer, because append() doesn't have the ambiguity that 
> insert() would with regards to just where the Token gets inserted.  So, 
> I think insert() will not see the light of day.
> 

yes, I like a single method, especially if you can set position explicitly. Is 
that what I'm seeing in set_pos_inc() ?

Xapian (for example), has this API:

  $doc->add_posting( $text, $pos, $weight )

which I like, since then I can artificially inflate the frequency of a word, or 
as in your synonym example, put similar words at the same position to increase 
hits for phrases, etc.


> Additionally, for the benefit of über-hackers such as yourself, I would 
> like to expose an API for adding many tokens at once.  That's important 
> for maximizing performance.  I know you've wrestled with this issue so 
> I'll be curious what you have to say.
> 
> TokenBatch presently has a private add_many_tokens() method, which takes 
> a string, an array of starts, and an array of ends.  The starts/ends are 
> bytecounts, rather than Unicode code points.
> 
>   my $string = 'i am the walrus';
>   my @starts = ( 0, 2, 5, 9 );
>   my @ends   = ( 1, 4, 8, 15 );
>   $batch->add_many_tokens( $string, \@starts, \@ends );
> 
> Here's how add_many_tokens would look if it were implemented in Perl:
> 
>   sub add_many_tokens {
>       my ( $batch, $orig_string, $starts, $ends ) = @_;
>       for my $i ( 0 .. $#$starts ) {
>           my $start    = $starts->{$i};
>           my $end      = $ends->{$i};
>           my $len      = $end - $start;
>           my $string   = bytes::substr( $orig_string, $start, $len );
>           my $token
>               = KinoSearch::Analysis::Token->new( $string, $start, $end );
>           $batch->append($token);
>       }
>   }
> 
> The C version of that is fastest way I've found for taking input from a 
> Tokenizer.  Grabbing substrings at the C level based on start and end 
> points is considerably faster than using regex captures from Perl 
> space.  However, that API less intuitive, and less flexible.  So perhaps 
> we need two advanced APIs.
> 
>    # new method
>    $batch->gen_tokens( \@strings, \@starts \@ends );
> 
>    # add_many_tokens, renamed
>    $batch->gen_tokens_substr( $string, \@starts, \@ends );
> 
> Thoughts?
> 

yes, this is a thorny issue. I'm not sure what would be the most useful to the 
most people. Here's my case, though, as one example.

Swish3's parser extracts all the text and context (i.e., which tags a text is 
wrapped in) from a XML or HTML doc, and creates a WordList object, which is just 
a simple C double-linked list. Each node in the WordList is a Word object (i.e. 
Token):

  struct swishp_Word
  {
             unsigned int           position;
             xmlChar *              metaname; /* context */
             xmlChar *              word;     /* token */
             unsigned int           start_offset;
             unsigned int           end_offset;
             struct swishp_Word *   next;
             struct swishp_Word *   prev;
  };

So I need a way to iterate over the WordList and add each Word/Token to the 
index. The start and end offsets are bytes, not code points (though as I look at 
it now, I probably need to make them longs or even bigger to account for really 
large files). And yes, I'd like to do it at the C level since it's such a tight 
loop and I don't need the overhead of Perl methods.

I was planning to get into the guts of TokenBatch and figure out a way to do 
that myself; if there's a way to make a solution part of the public API, so much 
the better.

Make sense?

I also have the additional challenge of figuring out how best to apply the same 
tokenizing algorithm (analyzer) used to create the swishp_Word object to parsing 
queries. Haven't had time to mull that one over yet.


-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com



More information about the kinosearch mailing list