[KinoSearch] TokenBatch API (was Content indexing question)
Marvin Humphrey
marvin at rectangular.com
Fri Nov 3 10:12:44 PST 2006
On Nov 3, 2006, at 6:22 AM, Peter Karman wrote:
> If you need an easy hack that won't be affected by KS API changes,
> you'd need to do it in your parser. Something like combining with a
> nonsense word that would still work with the default token_re:
>
> $doc->set_value('h1' => join(' nosuchwordhere ', @h1_contents));
> # note the spaces around ' nosuchwordhere '
>
> Unless someone searches for 'nosuchwordhere' they'd never get a
> match across block tag boundaries.
Yeah, that'd work. Nice. :)
Thanks for stepping in with the support help and freeing me up to
spend time on the rest of this post.
> aside to Marvin: I hope TokenBatch doesn't go away altogether;
> Swish3 assumes there will be a quick way to add a word list in a
> tight loop.
TokenBatch will not go away. It's marked that way because the API
isn't what I would like it to be.
TokenBatch is very important. KinoSearch does not offer a
smorgasboard of Tokenizers a la Lucene. There's only the one regex-
based Tokenizer, and TokenBatch for power users who want to roll
their own Analyzers. I'm counting on TokenBatch to keep the core a
manageable size, and that's a policy I hope to propagate to Lucy. I
think it's better to focus on making TokenBatch as good as it can be,
rather than on supporting a slew of specialized Analyzer subclasses.
When TokenBatch's API was created, I hadn't figured out how to expose
a Token in Perl-space without imposing the requirement that _every_
token get a Perl object _everywhere_. That would have slowed down
indexing substantially for the general case. Fortunately, the
reference-counting scheme that you and Dave showed me on lucy-dev
solves that problem. So now I am able to update TokenBatch's API to
what it should have been in the first place.
Here are the changes planned for Token and TokenBatch in 0.20:
* Token will get a public API and its own constructor.
* All of the accessor methods for manipulating the current token
will be removed from TokenBatch and given to Token instead.
* TokenBatch's append() method will be changed to take a Token
object. It will no longer serve as a factory method; you
must use the Token constructor to create an object which you
then add to the batch via $batch->append($token);
* A fetch_next() method which returns a Token will be added
to TokenBatch.
Currently, if you want to iterate over the tokens in an existing
TokenBatch and transform them somehow, this is what you do:
while ( $batch->next ) {
$batch->set_text( lc( $batch->get_text ) );
}
This is better, IMO:
while ( my $token = $batch->fetch_next ) {
$token->set_text( lc( $token->get_text ) );
}
Here's how the change in behavior for append() will affect things:
# current:
$batch->append( $text, $start, $end );
# new:
my $token = KinoSearch::Analysis::Token->new( $text, $start, $end );
$batch->append($token);
That's more verbose, but conceptually cleaner. (And power users will
want to use something other than append() anyway -- read on.)
There's another method I've considered adding: insert(), which would
splice a new Token into the batch either before or after the current
one. However, I think that adding insert() is unnecessary, since you
can effectively transform a TokenBatch by creating a new one and
moving tokens from the old one, processing en route. Here's how a
SynonymAnalyzer could be implemented:
while ( my $token = $batch->next ) {
# copy the current token over
$new_batch->append($token);
# add a new token with a pos_inc of 0 for each synonym
my $synonyms = get_synonyms( $token->get_text );
for (@$synonyms) {
my $synonym_token = $token->clone;
$synonym_token->set_text($_);
$synonym_token->set_pos_inc(0);
$new_batch->append($synonym_token);
}
}
I think that's clearer, because append() doesn't have the ambiguity
that insert() would with regards to just where the Token gets
inserted. So, I think insert() will not see the light of day.
Additionally, for the benefit of über-hackers such as yourself, I
would like to expose an API for adding many tokens at once. That's
important for maximizing performance. I know you've wrestled with
this issue so I'll be curious what you have to say.
TokenBatch presently has a private add_many_tokens() method, which
takes a string, an array of starts, and an array of ends. The starts/
ends are bytecounts, rather than Unicode code points.
my $string = 'i am the walrus';
my @starts = ( 0, 2, 5, 9 );
my @ends = ( 1, 4, 8, 15 );
$batch->add_many_tokens( $string, \@starts, \@ends );
Here's how add_many_tokens would look if it were implemented in Perl:
sub add_many_tokens {
my ( $batch, $orig_string, $starts, $ends ) = @_;
for my $i ( 0 .. $#$starts ) {
my $start = $starts->{$i};
my $end = $ends->{$i};
my $len = $end - $start;
my $string = bytes::substr( $orig_string, $start, $len );
my $token
= KinoSearch::Analysis::Token->new( $string, $start,
$end );
$batch->append($token);
}
}
The C version of that is fastest way I've found for taking input from
a Tokenizer. Grabbing substrings at the C level based on start and
end points is considerably faster than using regex captures from Perl
space. However, that API less intuitive, and less flexible. So
perhaps we need two advanced APIs.
# new method
$batch->gen_tokens( \@strings, \@starts \@ends );
# add_many_tokens, renamed
$batch->gen_tokens_substr( $string, \@starts, \@ends );
Thoughts?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list