[KinoSearch] Omitted results for different num_wanted values
Marvin Humphrey
marvin at rectangular.com
Thu Jul 2 09:48:11 PDT 2009
On Thu, Jul 02, 2009 at 05:01:41PM +0200, Nick Wellnhofer wrote:
> The index is built in one go without other additions or deletions.
OK, that eliminates a couple of likely candidates. The hit collection
mechanism has been heavily modified over the last couple months, in an effort
to max out speed. The transition between segments was my first guess, but if
there's only one segment (because the index was written in one shot), that
can't be it.
This really ought to be a hit collection problem. With only one segment and
no deletions, the hit collection loop simplifies down to this:
$hit_collector->set_matcher($matcher);
while (my $doc_num = $matcher->next) {
$hit_collector->collect($matcher);
}
The behavior of $matcher should not be affected by num_wanted. The set of
matched documents should be deterministic given a query and an index. It's up
to the collector to determine what to do with those matches.
So, that leaves SortCollector and HitQueue as the likely villains.
> I'm not using SortSpec, but I tried a SortSpec sorting by score and doc_id
> with the same result.
Yes. Supplying a sort-by-score-then-doc_id SortSpec ought to be exactly the
same as not supplying one, so that's as expected.
> Here are the doc_ids and scores for a query with num_wanted => 20
>
> 12126: 9.36577701568604
> 10623: 9.26733779907227
> 9592: 6.62579917907715
> 10686: 3.88833498954773
> 7776: 3.88633751869202
> 8081: 3.60492777824402
> 10923: 0.136016055941582 ***
> 11107: 0.136016055941582
> 9881: 0.118667937815189
> 10136: 0.108812846243382
> 9158: 0.078095369040966
> 10616: 0.0764531493186951
> 11217: 0.0635384768247604
> 10563: 0.0588966794312
> 12129: 0.048088937997818
> 8701: 0.0340040139853954
> 12257: 0.0119014047086239
Two observations. First, you asked for a maximum of 20 results, but there are
17. That means the HitQueue is not yet full, and the SortCollector has not
yet transitioned over to "queue full" mode.
Second, the missing result is in the middle of the list. That means it's
not likely to be an off-by-one error in something downstream like Hits.
I'm beginning to suspect that there's an error occurring at the moment we hit
num_wanted matches.
> num_wanted => 20
>
> 13533: 6.35173892974854
> 11709: 5.42276954650879
> 13288: 4.91336631774902
> 15935: 4.44877099990845
> 13292: 4.22429084777832
> 15941: 1.7977180480957
> 15918: 0.254095643758774 ***
> 15177: 0.222333684563637
> 15185: 0.222333684563637
> 13203: 0.1905717253685
> 13276: 0.158809781074524
> 15893: 0.158809781074524
> 13102: 0.127047821879387
> 13543: 0.127047821879387
> `
> num_wanted => 10
>
> 13533: 6.35173892974854
> 11709: 5.42276954650879
> 13288: 4.91336631774902
> 15935: 4.44877099990845
> 13292: 4.22429084777832
> 15941: 1.7977180480957
> 15177: 0.222333684563637
> 15185: 0.222333684563637
> 13203: 0.1905717253685
> 13276: 0.158809781074524
As before, we haven't hit num_wanted matches yet -- there are only 14. And if
we sort the results by doc_id, we find that the missing doc is the 12th one
collected.
1. 11709: 5.42276954650879
2. 13102: 0.127047821879387
3. 13203: 0.1905717253685
4. 13276: 0.158809781074524
5. 13288: 4.91336631774902
6. 13292: 4.22429084777832
7. 13533: 6.35173892974854
8. 13543: 0.127047821879387
9. 15177: 0.222333684563637
10. 15185: 0.222333684563637
11. 15893: 0.158809781074524
12. 15918: 0.254095643758774 ***
13. 15935: 4.44877099990845
14. 15941: 1.7977180480957
Let me look into this and report back.
Marvin Humphrey
More information about the kinosearch
mailing list