[KinoSearch] Playing with MultiSearcher framework

Marvin Humphrey marvin at rectangular.com
Wed Nov 7 12:36:10 PST 2007




On Nov 5, 2007, at 12:15 AM, Henry wrote:

>> It looks like KinoSearch::Searcher does not contain a "use
>> KinoSearch::Search::SortSpec" directive, so you'll have to add that
>> to the scripts running on the slave nodes.  I should probably add
>> that to Searcher.  Hey, what's one more module to load? :\
>
> OK - added the 'use' line to all nodes.  That's resolved that one.

Groovy.  I've added a 'use SortSpec' directive to Searchable as of  
r2608.

>> 'highlighter' is a required argument for $hits->create_excerpts, so
>> that first line would fail.  I should probably add validation code to
>> create_excerpts() so that a more meaningful error message gets  
>> produced.
>
> Right you are; sorry for missing that.

OK, glad that problem's solved.  I've strengthened the param checking  
with r2609.

> OK, using the following to create excerpts results in the error below:
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new;
> $highlighter->add_spec( field => 'body' );
> $highlighter->add_spec( field => 'title' );
> $hits->create_excerpts( highlighter => $highlighter );
>
> Can't call method "term_vector" on unblessed reference at
> /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/ 
> Highlight/Highlighter.pm
> line 226, <GEN2> line 1.

Hmm, curious.  It's not immediately apparent why that's happening.

However, I have a kill-many-birds-with-one-stone solution up my  
sleeve.  We're currently fetching the document correctly.  So let's  
add the term vector data to the document itself.  Put it in an  
auxiliary, binary field: e.g. content_HIGHLIGHTDATA.

The primary downside is that such a change is not backwards  
compatible, but we just made one backwards-incompatible change (doc  
nums starting at 1).  So it's time to jam in a bunch, while writing  
the file format spec.

A side effect is that highlighting won't be enabled by default any  
more.  That's a little less convenient, but it also means indexes  
will default to being smaller.

The *major* upside is that term vectors won't need to be part of the  
InvIndex file format spec.  :)  That section was going to be a PITA,  
and by ditching it, we keep things simple and finish the spec sooner.

> Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
> subset of the full index), excellent work!

Is there a performance difference between plain search and sorted  
search?  And are the invindexes optimized?

The primary theoretical flaw in the current sorted remote search  
implementation is that there may be a lot of disk thrash for an un- 
optimized index as term numbers are converted into terms.

> If this distributed search
> implementation is less than ideal, then I would imagine there are  
> great
> things to come.

Here's what I have in mind:

SegWriter becomes a public module, and takes on an API similar to  
that of PolyAnalyzer -- i.e. it becomes an array of writers.  This  
will allow us to subclass DocWriter with e.g. DistributedDocWriter.  
(PrimaryKeyOnlyDocWriter would be another useful possibility, if  
you're combining KS with an RDBMS).  That would allow us to have  
dedicated machines performing the role of fetching/highlighting.

Lexicons would be handled in similar fashion, as would posting  
lists.  The idea is to modularize things by task and write  
specialized modules for a distributed setup.  This is how e.g. Google  
does things, and I believe it's a better model than the current  
MultiSearcher.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list