[KinoSearch] Wildcards

Marvin Humphrey marvin at rectangular.com
Fri Feb 29 14:16:25 PST 2008


On Feb 29, 2008, at 10:11 AM, Father Chrysostomos wrote:

> If there are two documents, one containing ‘dog’ and ‘dot,’ and the  
> other containing just ‘dog’, and the search term is ‘do*’, then the  
> doc freq should be 2, since the term matches two docs. The doc freqs  
> of the individual docs are 2 and 1, respectively, so if we add them  
> together we get 3, and if we average them out, we get 1.5, neither  
> of which is the right answer.

Hrmm, yes, you're right about that.  There doesn't seem to be a good  
way to get the doc_freq for a wildcard at search-time without  
iterating through the posting lists.  If we really want to know that  
number, I think the only way is to generate it at index-time.  We're  
back to the "index all substrings" approach -- which you weren't all  
that enthused about.

Are you going to be able to do what you need to do using the public  
API for PostingList now exposed?

> Thank you for pointing this out. I’ve just realised that the  
> WildCardQuery implementation I’m working on iterates through them  
> twice. I’ll optimise it later.


OK.  Please keep in mind that iterating through posting lists for  
common terms is where the bulk of the cost lies when searching large  
indexes.

>> It would be overkill to add a large, complex CompositePostingList  
>> class to KS right now, in order to avoid short-term code duplication.
>
> Fair enough.


I should add a couple comments... To scale up well, a  
CompositePostingList class would need to be implemented in C.  Same  
thing for a WildCardAnalyzer.  Neither of them really needs to be in  
core -- they'll use C APIs that are planned to be exposed.  But we  
have to get to the point where they actually are exposed, so I'm going  
to continue focusing on developing the C API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the kinosearch mailing list