[KinoSearch] Re: Wildcards
Marvin Humphrey
marvin at rectangular.com
Fri Feb 15 13:16:44 PST 2008
On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:
>> Could you include a way for a set of terms to be treated as a
>> single term with regard to scoring, i.e., as if ‘fool’ and
>> ‘food’ (in a wildcard foo* match, for instance) were simple stored
>> as ‘foo’ in the index (the way word stemming works)?
The way the index is laid out, each term gets its own posting list
with its own set of ascending document numbers. Scorers have to
iterate through document numbers in ascending order. The only way to
combine multiple posting lists is to interleave the doc num sets.
The only search-time options are 1) run through each set and build up
a superset before the Scorer starts iterating, or 2) put multiple
PostingList objects into a priority queue sorted by ascending doc num.
Another approach is to break all terms into all possible substrings at
index time and store them in a separate "substrings" field. The size
of the index will explode, but then "foo*" becomes a simple term query
for "foo".
> (If I’m not making myself clear, please let me know.) If you don’t
> want to include this in core KinoSearch, could you at least bear
> this in mind? This would, I believe, affect the way doc_freq is
> calculated.
Yes, doc_freq is a difficult problem to solve with wildcards.
It's particularly hard when you get to dealing with several indexes
across multiple machines.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list