[KinoSearch] I'm getting fewer than expected results when supplying multiple fields

Adam . adamfletcher.work at googlemail.com
Sat Nov 10 03:23:31 PST 2007



On (09/11/07 20:51), Marvin Humphrey wrote:
> Hello Adam,
> 
> Thanks for the detailed report.
> 
> > I'm using the devel version (0.20_05).
> 
> Was this index originally built under 0.20_04, and does it have  
> deletions?  That's one known bug, leading to index corruption.

No, I created the index using 0.20_05 and it doesn't contain any
deletions.

> Also, how many segments are in the index?  (You can tell at a glance  
> by counting files with a ".cf" extension within the index directory.)

There is one segment file in the index, which is about 69MB in size.

> > Performing a search on another field, fieldx:foo, gives me 4481 hits.
> > I have confirmed that this quantity is correct for this field.
> >
> > When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> > lower quantity of 4449 hits. I've lost 32 documents.
> 
> This behavior suggests a bug in either ANDScorer or one of the  
> PostingList subclasses.
>
> [snip]
> 
> One possibility is that PostingList is reading records incorrectly,  
> so that the iterated doc nums don't match what ought to be in that  
> array.

I did consider that, particularly when I read about you changing them
to start at 1 (rather than zero), but that change doesn't affect 
0.20_05. I added in some debug code to output *my* unique identifier
for each document returned, but that didn't reveal anything more to me.

> [snip]
>
> The second possibility is that PostingList is fine, but ANDScorer is  
> performing the intersection improperly.
> 
> > Why would the 3 searches not yield the same results?
> 
> They should.
> 
> There are two stages of compilation for that particular query string:  
> QueryParser produces a BooleanQuery, and BooleanQuery produces a  
> BooleanScorer wrapping an ANDScorer.  ANDScorer operates on an array  
> of subscorers (in this case there would be two TermScorers in the  
> array), and the order in which the subscorers are arranged matters in  
> terms of how the intersection algorithm plays out.
> 
> My intuition is that if it's not the deletions issue, and that  
> ANDScorer_skip_to is to blame.  The algo, which is very similar to  
> that used by PhraseScorer, is only mildly convoluted, but it happens  
> to be hard to write tests for.
> 
> If you can supply a failing test case, I will work with that  
> directly.  Otherwise, I'll attempt to improve testing for ANDScorer  
> and hope that the bug shows itself.

I'll try to do that. I have already tried to rebuild the index so that
it only contains the 2 fields mentioned, and 4481 records, but the
results from that index are correct.

I'll strip out the irrelevent code/data and send my data and test case
to you off-list once I've got a refined example.

Thanks,

Adam

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list