[KinoSearch] getting total hits before a seek

Marvin Humphrey marvin at rectangular.com
Thu Mar 8 10:30:07 PST 2007


On Mar 8, 2007, at 9:16 AM, Brett Paden wrote:

> Just out of curiosity, why is a seek required to populate total hits?

The API for 0.15 is a little sneaky.  Calling Searcher->search  
doesn't actually run the matching/scoring.  Calling Hits->seek does.

During matching/scoring, doc_num/score pairs are not accumulated in  
an array or a hash, as most Perl programmers might suppose (including  
this one, who did something like that in the original  
Search::Kinosearch distribution).  They are put into a priority  
queue, which is much more efficient in terms of memory -- but also  
discards any matches that fall off the end once its capacity is  
exceeded.

Because Searcher->search in 0.15 doesn't know how many documents  
you're going to need, it can't know how big the priority queue needs  
to be.  So KS waits until Hits->seek, when that number can be derived  
by adding "offset" and "num_wanted".

KS has to complete the matching/scoring process before it can know  
the value that should be returned by Hits->total_hits.  In KS version  
0.05, total_hits() actually threw an error if you hadn't called seek 
() first.  In his perl.com review, though, chromatic panned this  
behavior as non-intuitive, so I added the internal seek().

In 0.20, things have changed.  "offset" and "num_wanted" have been  
added to the Searcher->search API so that it can actually run the  
search, which is what I think most people would expect.

Also now, Hits->seek only reruns the search if the size of the  
priority queue would exceed that of previous runs.  So if you call  
seek(0,100) then seek (0, 10), the search doesn't get rerun -- but if  
you call seek(0, 10) then seek(0, 20) or seek(10, 10), it does.

The absence of "offset" and "num_wanted" from the Searcher->search  
API and the activation of actual matching/scoring by the Hits object  
in 0.15 and earlier are traits inherited from Lucene.  People don't  
much like the behavior of Lucene's Hits class either, I've come to know.

A number of the changes in 0.20 are the product of insights gleaned  
after completing a working Lucene port.  When I was originally  
porting some of these classes, I didn't fully grok why Lucene did  
things a certain way, even though I'd written an entire search engine  
library myself earlier.  Now I have a better understanding, and it's  
possible to discard some of the cargo-cult programming.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list