[KinoSearch] utf8 warnings/error

Marvin Humphrey marvin at rectangular.com
Sat Aug 25 00:30:42 PDT 2007




On Aug 24, 2007, at 10:54 AM, Scott Beck wrote:

> I still can't reproduce these errors on a small test case :(

I know the feeling, and I appreciate your attempts.

There are a few variables in KS that are handy for dialing down the  
scale and exposing large problems with small datasets.

Here's a snippet from buildlib/TestSchema.pm...

   # Expose problems faced by much larger indexes by using absurdly  
low values
   # for index_interval and skip_interval.
   sub index_interval {5}
   sub skip_interval  {3}

... and another from buildlib/KinoTestUtils.pm:

   # set mem_thesh to 1 kiB in order to expose problems with flushing
   $KinoSearch::Index::PostingsWriter::instance_vars{mem_thresh} =  
0x400;

That last one affects the threshold that triggers the external  
sorter, and is probably the most useful.

> I did however get some feedback from valgrind although I don't know
> how helpful it is. I thought I would post it here as a follow up. I
> will continue to debug this and see if I can figure it out.

This readout is strange.  It implies that the regular expression  
matcher is attempting to match on parts of the string that are not  
allocated.

While the Tokenizer hacks a bit into Perl's internals, it's not doing  
anything outlandish -- just the equivalent of m/$pat/g.  It starts at  
the top of the string, asks the regular expression engine to find the  
first match, keeps asking it over and over until all matches have  
been collected.

I think by Perl 5.8.4 all the nastiest unicode bugs should have gone.

> valgrind errors from my tests:
> ==1766== Invalid read of size 1
> ==1766==    at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766==    by 0x813BF82: S_find_byclass (regexec.c:1248)
> ==1766==    by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766==    by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766==    by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766==    by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766==    by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766==    by 0x8064024: S_run_body (perl.c:1921)
> ==1766==    by 0x8063AE5: perl_run (perl.c:1840)
> ==1766==    by 0x805F69A: main (perlmain.c:86)
> ==1766==  Address 0x630C48F is 6 bytes after a block of size 17  
> alloc'd
> ==1766==    at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766==    by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766==    by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766==    by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766==    by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766==    by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766==    by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766==    by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766==    by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766==    by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766==    by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766==    by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766==    by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766==    by 0x8064024: S_run_body (perl.c:1921)
> ==1766==    by 0x8063AE5: perl_run (perl.c:1840)
> ==1766==    by 0x805F69A: main (perlmain.c:86)
>
> ==1766== Invalid read of size 1
> ==1766==    at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766==    by 0x8145D17: S_regrepeat (regexec.c:4089)
> ==1766==    by 0x814497E: S_regmatch (regexec.c:3732)
> ==1766==    by 0x813EE8D: S_regtry (regexec.c:2185)
> ==1766==    by 0x813BFA8: S_find_byclass (regexec.c:1249)
> ==1766==    by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766==    by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766==    by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766==    by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766==    by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766==    by 0x8064024: S_run_body (perl.c:1921)
> ==1766==    by 0x8063AE5: perl_run (perl.c:1840)
> ==1766==    by 0x805F69A: main (perlmain.c:86)
> ==1766==  Address 0x630C48F is 6 bytes after a block of size 17  
> alloc'd
> ==1766==    at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766==    by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766==    by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766==    by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766==    by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766==    by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766==    by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766==    by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766==    by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766==    by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766==    by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766==    by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766==    by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766==    by 0x8064024: S_run_body (perl.c:1921)
> ==1766==    by 0x8063AE5: perl_run (perl.c:1840)
> ==1766==    by 0x805F69A: main (perlmain.c:86)
>
> I don't know if this is related but after I index and then do a
> delete/insert, my index is really broken.

Yeah.  Unfortunately, that one looks like a real KS bug.  I can  
verify that merging segments with deletions can produce index  
corruption.  :(

I believe that the code that's to blame is new to KS 0.20_04.   
0.20_04 contains a refactoring of KinoSearch's external sorter to  
perform some of its own memory management, using some innovations  
recently uncovered by Lucene developer Michael McCandless as he  
implemented a variant of the KinoSearch merge model in Lucene.  This  
led to improved speed, but at the expense of increased complexity,  
and somewhere hidden in that complexity is a nasty little bug.

> I will continue to try and reduce the problem to as small a case as
> possible. Thanks for all your time and effort.

And thank you for yours. I'm  sorry I was not able to be more  
responsive during the week, but things are starting to lighten up a bit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list