[KinoSearch] Playing with MultiSearcher framework
Marvin Humphrey
marvin at rectangular.com
Wed Nov 7 12:36:10 PST 2007
On Nov 5, 2007, at 12:15 AM, Henry wrote:
>> It looks like KinoSearch::Searcher does not contain a "use
>> KinoSearch::Search::SortSpec" directive, so you'll have to add that
>> to the scripts running on the slave nodes. I should probably add
>> that to Searcher. Hey, what's one more module to load? :\
>
> OK - added the 'use' line to all nodes. That's resolved that one.
Groovy. I've added a 'use SortSpec' directive to Searchable as of
r2608.
>> 'highlighter' is a required argument for $hits->create_excerpts, so
>> that first line would fail. I should probably add validation code to
>> create_excerpts() so that a more meaningful error message gets
>> produced.
>
> Right you are; sorry for missing that.
OK, glad that problem's solved. I've strengthened the param checking
with r2609.
> OK, using the following to create excerpts results in the error below:
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new;
> $highlighter->add_spec( field => 'body' );
> $highlighter->add_spec( field => 'title' );
> $hits->create_excerpts( highlighter => $highlighter );
>
> Can't call method "term_vector" on unblessed reference at
> /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/
> Highlight/Highlighter.pm
> line 226, <GEN2> line 1.
Hmm, curious. It's not immediately apparent why that's happening.
However, I have a kill-many-birds-with-one-stone solution up my
sleeve. We're currently fetching the document correctly. So let's
add the term vector data to the document itself. Put it in an
auxiliary, binary field: e.g. content_HIGHLIGHTDATA.
The primary downside is that such a change is not backwards
compatible, but we just made one backwards-incompatible change (doc
nums starting at 1). So it's time to jam in a bunch, while writing
the file format spec.
A side effect is that highlighting won't be enabled by default any
more. That's a little less convenient, but it also means indexes
will default to being smaller.
The *major* upside is that term vectors won't need to be part of the
InvIndex file format spec. :) That section was going to be a PITA,
and by ditching it, we keep things simple and finish the spec sooner.
> Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
> subset of the full index), excellent work!
Is there a performance difference between plain search and sorted
search? And are the invindexes optimized?
The primary theoretical flaw in the current sorted remote search
implementation is that there may be a lot of disk thrash for an un-
optimized index as term numbers are converted into terms.
> If this distributed search
> implementation is less than ideal, then I would imagine there are
> great
> things to come.
Here's what I have in mind:
SegWriter becomes a public module, and takes on an API similar to
that of PolyAnalyzer -- i.e. it becomes an array of writers. This
will allow us to subclass DocWriter with e.g. DistributedDocWriter.
(PrimaryKeyOnlyDocWriter would be another useful possibility, if
you're combining KS with an RDBMS). That would allow us to have
dedicated machines performing the role of fetching/highlighting.
Lexicons would be handled in similar fashion, as would posting
lists. The idea is to modularize things by task and write
specialized modules for a distributed setup. This is how e.g. Google
does things, and I believe it's a better model than the current
MultiSearcher.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list