[KinoSearch] I'm getting fewer than expected results when supplying multiple fields

Marvin Humphrey marvin at rectangular.com
Fri Nov 9 20:51:25 PST 2007



Hello Adam,

Thanks for the detailed report.

> I'm using the devel version (0.20_05).

Was this index originally built under 0.20_04, and does it have  
deletions?  That's one known bug, leading to index corruption.

Also, how many segments are in the index?  (You can tell at a glance  
by counting files with a ".cf" extension within the index directory.)

> Performing a search on another field, fieldx:foo, gives me 4481 hits.
> I have confirmed that this quantity is correct for this field.
>
> When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> lower quantity of 4449 hits. I've lost 32 documents.

This behavior suggests a bug in either ANDScorer or one of the  
PostingList subclasses.

Think of a PostingList as an array of document numbers associated  
with a particular term.  You have two PostingLists (inside  
TermScorers), and it's ANDScorer's job to take the intersection.   
Conceptually, it's simple enough.  However, PostingList cannot be  
implemented as an array because that wouldn't scale -- under the  
hood, it's a iterator reading compressed records off of disk.

One possibility is that PostingList is reading records incorrectly,  
so that the iterated doc nums don't match what ought to be in that  
array.  That was what happened with the old deletions bug: the stream  
got out of sync because the data was garbage, and if KS didn't  
segfault outright, the results were incorrect.

The second possibility is that PostingList is fine, but ANDScorer is  
performing the intersection improperly.

> Why would the 3 searches not yield the same results?

They should.

There are two stages of compilation for that particular query string:  
QueryParser produces a BooleanQuery, and BooleanQuery produces a  
BooleanScorer wrapping an ANDScorer.  ANDScorer operates on an array  
of subscorers (in this case there would be two TermScorers in the  
array), and the order in which the subscorers are arranged matters in  
terms of how the intersection algorithm plays out.

My intuition is that if it's not the deletions issue, and that  
ANDScorer_skip_to is to blame.  The algo, which is very similar to  
that used by PhraseScorer, is only mildly convoluted, but it happens  
to be hard to write tests for.

If you can supply a failing test case, I will work with that  
directly.  Otherwise, I'll attempt to improve testing for ANDScorer  
and hope that the bug shows itself.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list