[KinoSearch] Playing with MultiSearcher framework
Henry
henka at cityweb.co.za
Wed Nov 7 23:23:44 PST 2007
>> Can't call method "term_vector" on unblessed reference at
>> /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/
>> Highlight/Highlighter.pm
>> line 226, <GEN2> line 1.
>
> Hmm, curious. It's not immediately apparent why that's happening.
>
> However, I have a kill-many-birds-with-one-stone solution up my
> sleeve. We're currently fetching the document correctly. So let's
> add the term vector data to the document itself. Put it in an
> auxiliary, binary field: e.g. content_HIGHLIGHTDATA.
>
> The primary downside is that such a change is not backwards
> compatible, but we just made one backwards-incompatible change (doc
> nums starting at 1). So it's time to jam in a bunch, while writing
> the file format spec.
Sounds good. I've paused global indexing for the time being anyway - busy
with data consolidation/rank analysis, etc.
> A side effect is that highlighting won't be enabled by default any
> more. That's a little less convenient, but it also means indexes
> will default to being smaller.
Smaller == faster (from an IO perspective anyway). So this is good news
indeed. I've already chomped the size of my indexes dramatically by
limiting the document sizes (reasonable 100k).
> The *major* upside is that term vectors won't need to be part of the
> InvIndex file format spec. :) That section was going to be a PITA,
> and by ditching it, we keep things simple and finish the spec sooner.
Excellent. KISS is good.
>> Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
>> subset of the full index), excellent work!
>
> Is there a performance difference between plain search and sorted
> search? And are the invindexes optimized?
[quick repeated tests without caching, nodes have no other activity]
With sort: ~0.515s
Without: ~0.450s
All indexes optimized.
> The primary theoretical flaw in the current sorted remote search
> implementation is that there may be a lot of disk thrash for an un-
> optimized index as term numbers are converted into terms.
>
>> If this distributed search
>> implementation is less than ideal, then I would imagine there are
>> great
>> things to come.
>
> Here's what I have in mind:
>
> SegWriter becomes a public module, and takes on an API similar to
> that of PolyAnalyzer -- i.e. it becomes an array of writers. This
> will allow us to subclass DocWriter with e.g. DistributedDocWriter.
> (PrimaryKeyOnlyDocWriter would be another useful possibility, if
> you're combining KS with an RDBMS). That would allow us to have
> dedicated machines performing the role of fetching/highlighting.
Great idea - distribute not only searching, but other processing as well.
> Lexicons would be handled in similar fashion, as would posting
> lists. The idea is to modularize things by task and write
> specialized modules for a distributed setup. This is how e.g. Google
> does things, and I believe it's a better model than the current
> MultiSearcher.
A nice modular distributed approach allowing more flexibility in terms of
overall (end-user) design and performance. Great for scaling up (and
sideways)...
I'm curios, this sounds like quite a bit of work - what's your thinking in
terms of schedule/time-line.
Regards
Henry
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list