[KinoSearch] Invalid UTF-8
Peter Karman
peter at peknet.com
Wed Jan 27 20:43:22 PST 2010
Marvin Humphrey wrote on 1/27/10 6:41 PM:
> On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:
>
>> Yup, I've now duplicated the problem on my system using 60,000 docs.
>
> Fixed by r5764.
cool. thanks for digging in.
I have tested it under RHEL (works great with ~90k docs, 2g of data) and OSX
10.6 (where it fails, see below), both 64-bit arch.
The OSX behaviour was weird. First time it segfaulted. Ran it again under gdb
and it completed ok. Ran it again without gdb and I got this:
[karpet at pekmac:~/tmp]$ perl ks-test.pl swishdocs2/
Crawled 1000000 documents
Read past EOF of
'/Volumes/users/karpet/tmp/test-ks-utf8/seg_2/ptemp-4284913-to-4383411' (offset:
4284913 len: 98498), S_refill at ../core/KinoSearch/Store/InStream.c line 145
at ks-test.pl line 65
Using same test script as I posted before, with 1m docs instead of 33k.
>
>> I bet I can get that way down by fiddling with the flush threshold.
>
> Ultimately, I was isolate the trigger to a single document with two fields, by
> bringing the threshold at which PostingListWriter flushes all of its
> PostingPools way, way down:
>
> -#define DEFAULT_MEM_THRESH 0x1000000
> +/* #define DEFAULT_MEM_THRESH 0x1000000 */
> +#define DEFAULT_MEM_THRESH 0x10
>
> When that variable lived in Perl, the KinoSearch::Test module used to set it
> to a much smaller number at load time. This had the effect of simulating
> large indexes as far as PostingListWriter was concerned, by forcing runs to be
> flushed many many times. However, it turns out that we have been doing
> without that important simulation for a long time -- the entire KS test suite
> was not triggering a PostingPool flush even once. I'm a little surprised that
> after all the refactoring I did on this code recently, there was only a single
> glitch that needed to be fixed.
>
> Now even if I set the threshold to 0x100, the whole test suite passes.
>
this is good and interesting to know. Is there, or any plan to, make the
DEFAULT_MEM_THRESH alterable at runtime? I'm assuming that in situations where
available ram is low, it would be helpful to trade-off speed for memory by
setting the threshold lower and flushing to disk more often. Is that a realistic
assumption?
--
Peter Karman . http://peknet.com/ . peter at peknet.com
More information about the kinosearch
mailing list