[KinoSearch] I'm getting fewer than expected results when supplying multiple fields
Marvin Humphrey
marvin at rectangular.com
Fri Nov 9 20:51:25 PST 2007
Hello Adam,
Thanks for the detailed report.
> I'm using the devel version (0.20_05).
Was this index originally built under 0.20_04, and does it have
deletions? That's one known bug, leading to index corruption.
Also, how many segments are in the index? (You can tell at a glance
by counting files with a ".cf" extension within the index directory.)
> Performing a search on another field, fieldx:foo, gives me 4481 hits.
> I have confirmed that this quantity is correct for this field.
>
> When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> lower quantity of 4449 hits. I've lost 32 documents.
This behavior suggests a bug in either ANDScorer or one of the
PostingList subclasses.
Think of a PostingList as an array of document numbers associated
with a particular term. You have two PostingLists (inside
TermScorers), and it's ANDScorer's job to take the intersection.
Conceptually, it's simple enough. However, PostingList cannot be
implemented as an array because that wouldn't scale -- under the
hood, it's a iterator reading compressed records off of disk.
One possibility is that PostingList is reading records incorrectly,
so that the iterated doc nums don't match what ought to be in that
array. That was what happened with the old deletions bug: the stream
got out of sync because the data was garbage, and if KS didn't
segfault outright, the results were incorrect.
The second possibility is that PostingList is fine, but ANDScorer is
performing the intersection improperly.
> Why would the 3 searches not yield the same results?
They should.
There are two stages of compilation for that particular query string:
QueryParser produces a BooleanQuery, and BooleanQuery produces a
BooleanScorer wrapping an ANDScorer. ANDScorer operates on an array
of subscorers (in this case there would be two TermScorers in the
array), and the order in which the subscorers are arranged matters in
terms of how the intersection algorithm plays out.
My intuition is that if it's not the deletions issue, and that
ANDScorer_skip_to is to blame. The algo, which is very similar to
that used by PhraseScorer, is only mildly convoluted, but it happens
to be hard to write tests for.
If you can supply a failing test case, I will work with that
directly. Otherwise, I'll attempt to improve testing for ANDScorer
and hope that the bug shows itself.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list