[KinoSearch] more abstract interfaces to kinosearch

Marvin Humphrey marvin at rectangular.com
Mon Jul 2 13:35:45 PDT 2007


On Jul 2, 2007, at 8:38 AM, Hans Dieter Pearcey wrote:

>> If you add two clauses to a BooleanQuery with SHOULD, then their
>> result sets get OR'd together.
>>
>>     $bool_query->add_clause( query => $term_query_a, occur =>
>> 'SHOULD' );
>>     $bool_query->add_clause( query => $term_query_b, occur =>
>> 'SHOULD' );
>
> Is this true even when (like me) you are only interested in matching?

In theory the (unfinished) MatchPosting class is supposed to help out  
with situations like yours.  However, because it doesn't store token  
position, it doesn't support phrase matching, and maybe it needs to  
be rethought.

> Also, is there some reason that this isn't documented?

No, there's no reason.  I guess I thought the capabilities were  
implied by the class name.  Looks like usability testing has revealed  
a flaw! ;)

>  a Query that did something like "match all documents"

That would be the as-yet-non-existent MatchAllDocsQuery, which would  
have an interface similar to MatchFieldQuery.  The difference between  
the two would be analogous to the difference between 'SELECT doc_num'  
and 'SELECT doc_num WHERE foo IS NOT NULL'.

>> If you don't care about scoring and you can reuse Filters, you should
>> use as many as practical.
>
> What if I can't reuse Filters, but I don't care about scoring?

If you can't reuse QueryFilters or PolyFilters, they offer no  
advantage.  They're probably mildly less efficient than just adding a  
clause to the query.

>> This is successful modularization, "divide and conquer", "loose
>> coupling", etc, in action.  Every class has its own reasonably
>> contained problem domain.  There are no "God Objects" that know too
>> much or do too much.  The components tolerate being assembled into
>> many different configurations.
>
> I agree that core KS has taken the right direction here.
>
> The one place where this seems less true is the distinction between  
> scoring and
> matching, as I noted previously.

Lucene has a family of Query subclasses called SpanQueries, which I  
have not ported and don't intend to ever put in KinoSearch's core.   
What I'd like to do is make it possible for someone to write a  
KSx::Spans distro.  It might even include a KSx::Spans::QueryParser  
subclass that uses SpanTermQuery in place of TermQuery and so on.

Analogously, it should be possible to create a suite of Queries/ 
Scorers which are optimized for matching alone.  I believe that the  
changes to KinoSearch's file format in 0.20 and the introduction of  
Posting should facilitate this, but the OO infrastructure needs more  
work.

In the meantime, the current KS query classes don't exactly suck if  
you just need matching. :)

> My guess (because I don't know anything about
> IR theory, or whatever) is that you assumed that of COURSE people  
> wouldn't want
> just matching and not scoring,

It's true that returning results ranked by relevancy is something I  
put a high priority on, but I've definitely thought about other  
cases.  It's just that unstructured search is a more pressing  
problem.  There are a lot of good databases out there.  KS shouldn't  
aspire to compete with PostgreSQL.

> Of course, I'm approaching it from a different direction, so I have  
> different
> assumptions; I want to treat KS more like a traditional database,  
> which means I
> have different expectations, 'unique' constraints, stuff like that.

KinoSearch is always going to be optimized for the use case of a  
large number of queries against a single view of an index.

I don't think we'll have to make a choice between matching alone and  
matching with scoring, though.  It should be possible to support both  
without compromise.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the kinosearch mailing list