[KinoSearch] more abstract interfaces to kinosearch
Marvin Humphrey
marvin at rectangular.com
Thu Jun 28 23:00:21 PDT 2007
On Jun 28, 2007, at 6:54 PM, Hans Dieter Pearcey wrote:
> I can't find any way to do the OR, there, in a Query, only with
> PolyFilter.
BooleanQuery?
> "filters should be reused because they cache stuff, while queries
> can be
> one-off because they don't",
In general, that's the case.
Filters are on-off, so their result can be cached in a BitVector.
Caching the result of a Scorer (derived from a Query) would be much
more expensive, because you'd need to keep 1 32-bit doc number and 1
float score around for each match. In a worst case scenario -- every
doc matches -- we're talkin' 64 times as expensive as a Filter.
> Likewise, is there anything about the RangeFilter that would make it
> difficult to turn into a query?
Actually, there is. It's hard to know how matches should contribute
to the score.
If you issue a query which effectively says "give me matches where
content contains 'foo' OR date is greater than 2000-01-01", how
should docs which match only the date compare to docs which only
match foo? If the date is very rare, should it contribute more?
Lucene offers three solutions.
One, punt and say "use a RangeFilter". This is the approach
KinoSearch has taken.
Two, weight rare terms within the range more heavily than common
terms within the range. This is expensive and produces occasionally
bizarre results. It's the oldest approach, and the consensus is that
it's not a very good one.
Three, apply a constant score, as in the Lucene class
ConstantScoreRangeQuery. I like this approach so long as the
constant score is zero :) which basically makes it a filter. :) Once
you have non-zero scores, though, scaling the contribution that it
should make to the score is a headache.
Another approach would be to find the number of terms that exist
within the range, sum their doc_freqs together, and use that to
calculate IDF and weight the score. That's expensive, though, and
only appropriate for esoteric situations.
I think the ultimate solution will be to make MatchFieldQuery public
and give it a constant score which defaults to zero. Then it could
be combined with a RangeFilter to produce the same effect as a
ConstantScoreRangeQuery. MatchFieldQuery is relatively simple, and
lets you do things that require kludges otherwise.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list