[KinoSearch] Scorer->next
webmasters at ctosonline.org
webmasters at ctosonline.org
Fri May 15 21:38:44 PDT 2009
On May 15, 2009, at 12:12 PM, Marvin Humphrey wrote:
> On Thu, May 14, 2009 at 08:42:32AM -0700, webmasters at ctosonline.org
> wrote:
>> I’ve just discovered the hard way that Scorer->next has to return doc
>> ids in order. Here is a patch to mention it in the cookbook:
>
> Thanks, applied as r4594.
>
>> Is it this way to allow KS to tiptoe over deleted documents
>> efficiently?
>
> It's all about scalability.
>
> When scoring, all doc id sets are represented as iterators. To get
> the
> intersection of iterators, they need to proceed in an orderly,
> predictable
> fashion.
>
> Say you want the top hits for "foo AND bar". If you weren't worried
> about
> scalability, you could cache separate result sets as arrays -- one
> for 'foo'
> and one for 'bar' -- then intersect them, sort, and grab the top few
> results.
> However, that doesn't scale up to millions of hits because of the
> size of
> those result sets.
>
> So instead, the iterators for 'foo' and 'bar' ascend through doc
> nums in sync,
> hits are collected one at a time into a priority queue which only
> holds as
> many hits as absolutely necessary, and less relevant hits fall out
> the bottom
> of the queue, keeping memory costs under control.
>
> A better implementation of PrefixQuery would actually use a priority
> queue to
> hold the PostingLists rather than grab all the results at
> construction time.
> But then the cookbook entry would have to be a lot longer and more
> elaborate.
Since you’ve taken the time to write this, how about including it in a
comment in one of the source files, or even in documentation?
Father Chrysostomos
More information about the kinosearch
mailing list