[KinoSearch] more abstract interfaces to kinosearch

Marvin Humphrey marvin at rectangular.com
Mon Jul 2 08:17:55 PDT 2007


On Jun 29, 2007, at 5:45 AM, Hans Dieter Pearcey wrote:

>> BooleanQuery?
>
> I don't see how I'd do this just in terms of matching.  Maybe I don't
> understand SHOULD?

If you add two clauses to a BooleanQuery with SHOULD, then their  
result sets get OR'd together.

     $bool_query->add_clause( query => $term_query_a, occur =>  
'SHOULD' );
     $bool_query->add_clause( query => $term_query_b, occur =>  
'SHOULD' );

> If some particular selection mechanism is available both as a Query  
> and as a
> Filter -- e.g. BooleanQuery, which you can also use as part of a  
> Queryfilter --
> is there any reason to prefer one over the other, assuming that you  
> are (as I
> am) only interested in matching, not scoring?  Do Filters have any  
> kind of
> startup overhead compared to Queries, etc.?

If you don't care about scoring and you can reuse Filters, you should  
use as many as practical.

Scorers require hitting the disk.

QueryFilters and PolyFilters, once their internal caches are warmed,  
do not.

The startup cost for a RangeFilter only happens once per field per  
IndexReader, when a portion of that field's lexicon is read into  
memory.  The main per-query cost is a single burst of disk activity  
to look up the search term and and assign it a "term number" based on  
where it falls in the lexicon, after which everything else is CPU  
crunching and memory access.

>> I think the ultimate solution will be to make MatchFieldQuery public
>> and give it a constant score which defaults to zero.   Then it could
>> be combined with a RangeFilter to produce the same effect as a
>> ConstantScoreRangeQuery.  MatchFieldQuery is relatively simple, and
>> lets you do things that require kludges otherwise.
>
> I had found MatchFieldQuery, and thought that that might work, but  
> didn't know
> enough internals to be sure.  I like this idea.  What can I do to  
> make it work?

Sorry for the delayed response -- I had to think this over.

I've resisted making MatchFieldQuery public because I didn't feel  
like its API was mature enough.  I'm still not sure about it, and I  
don't want to add it to the list of things that have to get done  
prior to the release of 0.20.  For the time being, I suggest you go  
ahead and use MatchFieldQuery as is, but mark that aspect of your  
module experimental.  Looking forward, you can help move things along  
by participating in design discussions about subclassing strategies.

A lot of the KS public API and class design is pretty solid.  To  
touch on one aspect, I'm pleased that the Query components allow you  
to create your own query building mechanism as an alternative to  
QueryParser.  I'm also more certain than ever that the decision to  
limit QueryParser to a much simpler syntax than its Lucene  
counterpart was the right one.  What you are doing demonstrates that  
it is possible to write custom KSx extensions to play the Query- 
building role, and if someone wants to write a Lucene-ish query  
parser that supports syntax like 'boost^3', they can.  Core  
KinoSearch, by opting out of the more complex high-level task, lowers  
its support costs and maintains greater flexibility.

This is successful modularization, "divide and conquer", "loose  
coupling", etc, in action.  Every class has its own reasonably  
contained problem domain.  There are no "God Objects" that know too  
much or do too much.  The components tolerate being assembled into  
many different configurations.

The main goal of KinoSearch 0.30 will be to reproduce this  
flexibility across more phases of search and indexing.  Scorer should  
be public and it should not be so challenging to subclass.  If that  
were already the case, somebody could whip up KSx::Search::RangeQuery  
and you could use it without waiting for me to act.

For 0.20, though, it's time to think reductively (to echo a sentiment  
expressed by Nathan Kurz).  Rather than add new public APIs, it's  
time to yank Hits->seek (and simplify Searcher), migrate some  
documentation out of POD and onto to the new wiki, and possibly  
redact the public APIs for Analyzer, Token, and TokenBatch, marking  
them as experimental once again so that we have the option to modify  
them.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list