[KinoSearch] utf8 warnings/error
Marvin Humphrey
marvin at rectangular.com
Sat Aug 25 00:30:42 PDT 2007
On Aug 24, 2007, at 10:54 AM, Scott Beck wrote:
> I still can't reproduce these errors on a small test case :(
I know the feeling, and I appreciate your attempts.
There are a few variables in KS that are handy for dialing down the
scale and exposing large problems with small datasets.
Here's a snippet from buildlib/TestSchema.pm...
# Expose problems faced by much larger indexes by using absurdly
low values
# for index_interval and skip_interval.
sub index_interval {5}
sub skip_interval {3}
... and another from buildlib/KinoTestUtils.pm:
# set mem_thesh to 1 kiB in order to expose problems with flushing
$KinoSearch::Index::PostingsWriter::instance_vars{mem_thresh} =
0x400;
That last one affects the threshold that triggers the external
sorter, and is probably the most useful.
> I did however get some feedback from valgrind although I don't know
> how helpful it is. I thought I would post it here as a follow up. I
> will continue to debug this and see if I can figure it out.
This readout is strange. It implies that the regular expression
matcher is attempting to match on parts of the string that are not
allocated.
While the Tokenizer hacks a bit into Perl's internals, it's not doing
anything outlandish -- just the equivalent of m/$pat/g. It starts at
the top of the string, asks the regular expression engine to find the
first match, keeps asking it over and over until all matches have
been collected.
I think by Perl 5.8.4 all the nastiest unicode bugs should have gone.
> valgrind errors from my tests:
> ==1766== Invalid read of size 1
> ==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766== by 0x813BF82: S_find_byclass (regexec.c:1248)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
> ==1766== Address 0x630C48F is 6 bytes after a block of size 17
> alloc'd
> ==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
>
> ==1766== Invalid read of size 1
> ==1766== at 0x814C06F: Perl_swash_fetch (utf8.c:1747)
> ==1766== by 0x8145D17: S_regrepeat (regexec.c:4089)
> ==1766== by 0x814497E: S_regmatch (regexec.c:3732)
> ==1766== by 0x813EE8D: S_regtry (regexec.c:2185)
> ==1766== by 0x813BFA8: S_find_byclass (regexec.c:1249)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
> ==1766== Address 0x630C48F is 6 bytes after a block of size 17
> alloc'd
> ==1766== at 0x401B507: malloc (vg_replace_malloc.c:149)
> ==1766== by 0x80BCFEB: Perl_safesysmalloc (util.c:67)
> ==1766== by 0x80E1817: Perl_sv_grow (sv.c:1637)
> ==1766== by 0x80E6E24: Perl_sv_setsv_flags (sv.c:4019)
> ==1766== by 0x80EDD93: Perl_newSVsv (sv.c:7049)
> ==1766== by 0x814BC09: Perl_swash_fetch (utf8.c:1717)
> ==1766== by 0x8149F03: Perl_is_utf8_alnum (utf8.c:1191)
> ==1766== by 0x813BF0C: S_find_byclass (regexec.c:1246)
> ==1766== by 0x813E456: Perl_regexec_flags (regexec.c:1945)
> ==1766== by 0x8138538: Perl_pregexec (regexec.c:323)
> ==1766== by 0x61284B9:
> XS_KinoSearch__Analysis__Tokenizer__do_analyze (KinoSearch.xs:4741)
> ==1766== by 0x80DE048: Perl_pp_entersub (pp_hot.c:2854)
> ==1766== by 0x80BCA83: Perl_runops_debug (dump.c:1442)
> ==1766== by 0x8064024: S_run_body (perl.c:1921)
> ==1766== by 0x8063AE5: perl_run (perl.c:1840)
> ==1766== by 0x805F69A: main (perlmain.c:86)
>
> I don't know if this is related but after I index and then do a
> delete/insert, my index is really broken.
Yeah. Unfortunately, that one looks like a real KS bug. I can
verify that merging segments with deletions can produce index
corruption. :(
I believe that the code that's to blame is new to KS 0.20_04.
0.20_04 contains a refactoring of KinoSearch's external sorter to
perform some of its own memory management, using some innovations
recently uncovered by Lucene developer Michael McCandless as he
implemented a variant of the KinoSearch merge model in Lucene. This
led to improved speed, but at the expense of increased complexity,
and somewhere hidden in that complexity is a nasty little bug.
> I will continue to try and reduce the problem to as small a case as
> possible. Thanks for all your time and effort.
And thank you for yours. I'm sorry I was not able to be more
responsive during the week, but things are starting to lighten up a bit.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list