[KinoSearch] Re: Wildcards

Father Chrysostomos sprout at cpan.org
Fri Feb 15 22:25:15 PST 2008


On Feb 15, 2008, at 1:16 PM, Marvin Humphrey wrote:

>
> On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:
>
>>> Could you include a way for a set of terms to be treated as a  
>>> single term with regard to scoring, i.e., as if ‘fool’ and  
>>> ‘food’ (in a wildcard foo* match, for instance) were simple stored  
>>> as ‘foo’ in the index (the way word stemming works)?
>
> The way the index is laid out, each term gets its own posting list  
> with its own set of ascending document numbers.  Scorers have to  
> iterate through document numbers in ascending order.  The only way  
> to combine multiple posting lists is to interleave the doc num  
> sets.   The only search-time options are 1) run through each set and  
> build up a superset before the Scorer starts iterating, or 2) put  
> multiple PostingList objects into a priority queue sorted by  
> ascending doc num.
>
> Another approach is to break all terms into all possible substrings  
> at index time and store them in a separate "substrings" field.  The  
> size of the index will explode, but then "foo*" becomes a simple  
> term query for "foo".

I don’t think this latter approach is good, because it’s too specific.  
I have other uses for this besides wildcards (Greek word-stemming—not  
an easy task).

>> (If I’m not making myself clear, please let me know.) If you don’t  
>> want to include this in core KinoSearch, could you at least bear  
>> this in mind? This would, I believe, affect the way doc_freq is  
>> calculated.
>
> Yes, doc_freq is a difficult problem to solve with wildcards.
>
> It's particularly hard when you get to dealing with several indexes  
> across multiple machines.

I’ve written some code that follows approach #1 above, namely, it  
iterates through the posting lists one after the other, keeping a list  
of doc nums that have been seen. It counts them afterwards, to get an  
accurate ‘doc_freq’. Is this something you would be willing to include  
in core, so I don’t have to repeat it in multiple subclasses?


Father Chrysostomos

P.S.: I have a couple of other projects that are going to get in the  
way for probably more than a week, so I won’t be very responsive.




More information about the kinosearch mailing list