[KinoSearch] more abstract interfaces to kinosearch

Marvin Humphrey marvin at rectangular.com
Thu Jun 28 23:00:21 PDT 2007


On Jun 28, 2007, at 6:54 PM, Hans Dieter Pearcey wrote:

> I can't find any way to do the OR, there, in a Query, only with  
> PolyFilter.

BooleanQuery?

> "filters should be reused because they cache stuff, while queries  
> can be
> one-off because they don't",

In general, that's the case.

Filters are on-off, so their result can be cached in a BitVector.

Caching the result of a Scorer (derived from a Query) would be much  
more expensive, because you'd need to keep 1 32-bit doc number and 1  
float score around for each match.  In a worst case scenario -- every  
doc matches -- we're talkin' 64 times as expensive as a Filter.

> Likewise, is there anything about the RangeFilter that would make it
> difficult to turn into a query?

Actually, there is.  It's hard to know how matches should contribute  
to the score.

If you issue a query which effectively says "give me matches where  
content contains 'foo' OR date is greater than 2000-01-01", how  
should docs which match only the date compare to docs which only  
match foo?  If the date is very rare, should it contribute more?

Lucene offers three solutions.

One, punt and say "use a RangeFilter".  This is the approach  
KinoSearch has taken.

Two, weight rare terms within the range more heavily than common  
terms within the range.  This is expensive and produces occasionally  
bizarre results.  It's the oldest approach, and the consensus is that  
it's not a very good one.

Three, apply a constant score, as in the Lucene class  
ConstantScoreRangeQuery.  I like this approach so long as the  
constant score is zero :) which basically makes it a filter. :)  Once  
you have non-zero scores, though, scaling the contribution that it  
should make to the score is a headache.

Another approach would be to find the number of terms that exist  
within the range, sum their doc_freqs together, and use that to  
calculate IDF and weight the score.  That's expensive, though, and  
only appropriate for esoteric situations.

I think the ultimate solution will be to make MatchFieldQuery public  
and give it a constant score which defaults to zero.   Then it could  
be combined with a RangeFilter to produce the same effect as a  
ConstantScoreRangeQuery.  MatchFieldQuery is relatively simple, and  
lets you do things that require kludges otherwise.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/






More information about the kinosearch mailing list