[KinoSearch] State of multisearcher/sorting in svn
Marvin Humphrey
marvin at rectangular.com
Tue Jun 10 10:15:48 PDT 2008
On Jun 10, 2008, at 1:23 AM, Henry wrote:
> I've been diligently reading (in some cases glassily, so I may have
> missed
> something important:) the subversion commits and noticed:
> Log:
> Port the rest of SortSpec to C.
There weren't any meaningful functional changes in that commit. It
was just another step in the process of porting the modules, so that
KS can run from C and be bound to other languages.
> can you provide a description of the current
> status of multisearch/sorting (as of latest svn commit)? I vaguely
> recall
> that the two (multisearch/sort) were on your todo list at some point.
There's a working implementation, but it's disabled by default and
requires an undocumented call to enable it.
KinoSearch::Search::MultiSearcher->set_enable_sorting(1);
It's that way because I basically want only people who are subscribed
to this list to be able to use that feature.
Sorting at the single machine level works pretty well. The "sort
cache" which is maintained for each sortable field, is actually an
array of 32-bit integers, one for each document, which indicates the
document's rank in a list sorted on that field. When a sorted search
is requested, these rank numbers are compared, rather than the
original field values. It's very fast, and the memory footprint to
maintain the cache, while substantial, is smaller because we only need
32-bit integers rather than the original strings.
Unfortunately, that model breaks down at the multi-machine level
because the rank numbers are no longer comparable. That means that
once we have the top hits for each node, we have to retrieve the
original string values, send them across the network, and sort at the
master node.
The infrastructure required to pull that trick off is quite
elaborate. It took a long time to write, and I'm concerned that by
dint of its sheer size that there are bugs lurking. In particular, I
don't like the implementation of MultiLexicon. I wish there was a
better way.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list