[KinoSearch] indexing speed of 0.20_02

Marvin Humphrey marvin at rectangular.com
Thu Mar 15 15:43:34 PST 2007


On Mar 15, 2007, at 9:21 AM, Roger Dooley wrote:

> I've just started working with the devel release and have modified  
> my indexer for 0.15 to the new model.  The document set is rather  
> large (+.5 million) and indexing this took many hours with the 0.15  
> release. However, with 0.20, I haven't been able to index the files  
> as the indexing seems to be taking days and I end up killing the  
> process and looking at the code again.

At least some of the slowdown is a side effect of UTF-8 compatibility  
in 0.20.  Tokenizer is a major offender, and the bottleneck is Perl's  
UTF-8 character class regex implementation.

I'm a little surprised by the scale, though.  According to my  
benchmarking tests, we'd taken about a 35% hit, going from around 3.1  
seconds under 0.15 to around 4.2 seconds for 0.20.  We actually lost  
a lot more than that with the transition to UTF-8, but I've continued  
to make strides optimizing the engine -- if you take Tokenizer out of  
the loop, and use a purpose-built C tokenizer instead (the  
ASCIIWhiteSpaceTokenizer in devel/benchmarks/BenchMarkingIndexer.pm),  
0.20 is actually 30% *faster* than 0.15, at 1.82 secs vs 2.62 secs.

However, my benchmarker script only uses a Tokenizer.  If your  
analyzer incorporates a Stemmer or a Stopalizer, there may be  
additional drags I hadn't been measuring.  Stemmer seems like a more  
likely culprit, since that's changed to UTF-8 and I don't know how  
UTF-8 Snowball performs in comparison to Latin-1 Snowball.   
Stopalizer is also a possibility, but I'm not sure that hash lookups  
are slower under UTF-8 -- I wouldn't think so.  LCNormalizer is  
almost certainly slower, but I wouldn't guess it would affect things  
too much since it only hits the string once.

Here are some stats originally compiled for a post I made to the Perl  
5 Porters list: <http://www.nntp.perl.org/group/perl.perl5.porters/ 
2007/02/msg121014.html>

     ==================================================================
     Mean time to index 1000 ASCII news articles
     ------------------------------------------------------------------
     tokenizer         5.8.6 (thr)     5.8.8 (no thr)    blead (no thr)
     ------------------------------------------------------------------
     UTF-8 regex       4.18 secs       3.72 secs         3.80 secs
     Latin-1 regex     2.84 secs       2.50 secs         2.60 secs
     Purpose-built C   1.82 secs       1.60 secs         1.64 secs


It turns out that Perl's current UTF-8 char-class implementation is  
sub-optimal.  Yves Orton (a.k.a. demerphq) and I have had some  
preliminary discussions about how to go about improving it.  Yves has  
actually made the regex engine pluggable in blead; what may happen  
eventually is that after 5.10 comes out I'll hack up a slightly  
tweaked version of the regex engine which (only) Tokenizer will use.

I'd actually love to go in and hack on Perl's regex engine right now,  
and the work to implement char classes in terms of "inversion lists"  
probably isn't insane (bwa ha ha).  However, I haven't done so  
because 1) I'd have to invest some time to come up to speed on the  
gory details of the regex engine, and 2) KinoSearch's indexing  
performance has been good enough up till now that it's been more  
important to work on other features.

It may be time to make another stab at moving the Tokenizer loop to C.

             while (/$token_re/g) {
                 push @starts, $-[0];
                 push @ends,   $+[0];
             }

The first time I tried that preceded Yves' exposing and documenting  
of the regex engine API: <http://search.cpan.org/~rgarcia/perl-5.9.4/ 
pod/perlreguts.pod>.  With the aid of the new docs, I can probably  
figure things out for blead, then backport for 5.8.x.

There are significant inefficiencies in how @- and @+ are retrieved  
under UTF-8 -- they calculate UTF-8 length every time -- and that's  
damned inefficient if you're doing it for every token.  (This happens  
in the function Perl_magic_regdatum_get() in mg.c.).  If I can run  
the loop in C, I can get at the original numbers from the regex  
engine struct and avoid that.

If you don't want to wait for me to complete this work and you have  
Inline C skillz, you might try carving up your own Tokenizer based on  
ASCIIWhiteSpaceTokenizer.

Otherwise, if you (or anybody else) wants to help me out, I could use  
some benchmarking numbers with various configs.  Time I spend doing  
the benchmarking (which other people can do) is time I don't spend  
rooting around in the scariest crags of Perl and KinoSearch C code  
(which not many other people are going to be able to do).  Different  
Analyzers would be very helpful.  So would long vs. short source  
strings.

Hope this long winded reply helps you -- composing it helped me.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list