[KinoSearch] getting total hits before a seek
Marvin Humphrey
marvin at rectangular.com
Thu Mar 8 10:30:07 PST 2007
On Mar 8, 2007, at 9:16 AM, Brett Paden wrote:
> Just out of curiosity, why is a seek required to populate total hits?
The API for 0.15 is a little sneaky. Calling Searcher->search
doesn't actually run the matching/scoring. Calling Hits->seek does.
During matching/scoring, doc_num/score pairs are not accumulated in
an array or a hash, as most Perl programmers might suppose (including
this one, who did something like that in the original
Search::Kinosearch distribution). They are put into a priority
queue, which is much more efficient in terms of memory -- but also
discards any matches that fall off the end once its capacity is
exceeded.
Because Searcher->search in 0.15 doesn't know how many documents
you're going to need, it can't know how big the priority queue needs
to be. So KS waits until Hits->seek, when that number can be derived
by adding "offset" and "num_wanted".
KS has to complete the matching/scoring process before it can know
the value that should be returned by Hits->total_hits. In KS version
0.05, total_hits() actually threw an error if you hadn't called seek
() first. In his perl.com review, though, chromatic panned this
behavior as non-intuitive, so I added the internal seek().
In 0.20, things have changed. "offset" and "num_wanted" have been
added to the Searcher->search API so that it can actually run the
search, which is what I think most people would expect.
Also now, Hits->seek only reruns the search if the size of the
priority queue would exceed that of previous runs. So if you call
seek(0,100) then seek (0, 10), the search doesn't get rerun -- but if
you call seek(0, 10) then seek(0, 20) or seek(10, 10), it does.
The absence of "offset" and "num_wanted" from the Searcher->search
API and the activation of actual matching/scoring by the Hits object
in 0.15 and earlier are traits inherited from Lucene. People don't
much like the behavior of Lucene's Hits class either, I've come to know.
A number of the changes in 0.20 are the product of insights gleaned
after completing a working Lucene port. When I was originally
porting some of these classes, I didn't fully grok why Lucene did
things a certain way, even though I'd written an entire search engine
library myself earlier. Now I have a better understanding, and it's
possible to discard some of the cargo-cult programming.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list