[KinoSearch] Wildcards
Marvin Humphrey
marvin at rectangular.com
Fri Feb 29 14:16:25 PST 2008
On Feb 29, 2008, at 10:11 AM, Father Chrysostomos wrote:
> If there are two documents, one containing dog and dot, and the
> other containing just dog, and the search term is do*, then the
> doc freq should be 2, since the term matches two docs. The doc freqs
> of the individual docs are 2 and 1, respectively, so if we add them
> together we get 3, and if we average them out, we get 1.5, neither
> of which is the right answer.
Hrmm, yes, you're right about that. There doesn't seem to be a good
way to get the doc_freq for a wildcard at search-time without
iterating through the posting lists. If we really want to know that
number, I think the only way is to generate it at index-time. We're
back to the "index all substrings" approach -- which you weren't all
that enthused about.
Are you going to be able to do what you need to do using the public
API for PostingList now exposed?
> Thank you for pointing this out. Ive just realised that the
> WildCardQuery implementation Im working on iterates through them
> twice. Ill optimise it later.
OK. Please keep in mind that iterating through posting lists for
common terms is where the bulk of the cost lies when searching large
indexes.
>> It would be overkill to add a large, complex CompositePostingList
>> class to KS right now, in order to avoid short-term code duplication.
>
> Fair enough.
I should add a couple comments... To scale up well, a
CompositePostingList class would need to be implemented in C. Same
thing for a WildCardAnalyzer. Neither of them really needs to be in
core -- they'll use C APIs that are planned to be exposed. But we
have to get to the point where they actually are exposed, so I'm going
to continue focusing on developing the C API.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list