[KinoSearch] Re: Wildcards
Father Chrysostomos
sprout at cpan.org
Fri Feb 15 22:25:15 PST 2008
On Feb 15, 2008, at 1:16 PM, Marvin Humphrey wrote:
>
> On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:
>
>>> Could you include a way for a set of terms to be treated as a
>>> single term with regard to scoring, i.e., as if ‘fool’ and
>>> ‘food’ (in a wildcard foo* match, for instance) were simple stored
>>> as ‘foo’ in the index (the way word stemming works)?
>
> The way the index is laid out, each term gets its own posting list
> with its own set of ascending document numbers. Scorers have to
> iterate through document numbers in ascending order. The only way
> to combine multiple posting lists is to interleave the doc num
> sets. The only search-time options are 1) run through each set and
> build up a superset before the Scorer starts iterating, or 2) put
> multiple PostingList objects into a priority queue sorted by
> ascending doc num.
>
> Another approach is to break all terms into all possible substrings
> at index time and store them in a separate "substrings" field. The
> size of the index will explode, but then "foo*" becomes a simple
> term query for "foo".
I don’t think this latter approach is good, because it’s too specific.
I have other uses for this besides wildcards (Greek word-stemming—not
an easy task).
>> (If I’m not making myself clear, please let me know.) If you don’t
>> want to include this in core KinoSearch, could you at least bear
>> this in mind? This would, I believe, affect the way doc_freq is
>> calculated.
>
> Yes, doc_freq is a difficult problem to solve with wildcards.
>
> It's particularly hard when you get to dealing with several indexes
> across multiple machines.
I’ve written some code that follows approach #1 above, namely, it
iterates through the posting lists one after the other, keeping a list
of doc nums that have been seen. It counts them afterwards, to get an
accurate ‘doc_freq’. Is this something you would be willing to include
in core, so I don’t have to repeat it in multiple subclasses?
Father Chrysostomos
P.S.: I have a couple of other projects that are going to get in the
way for probably more than a week, so I won’t be very responsive.
More information about the kinosearch
mailing list