[KinoSearch] Wildcards
Nathan Kurz
nate at verse.com
Fri Jan 25 12:11:50 PST 2008
On 1/25/08, Marvin Humphrey <marvin at rectangular.com> wrote:
> If we transform the wildcard into an array
> of TermQuery objects, then each of them has an individual IDF -- so
> in a search for "pet*", the rare term "petard" will contibute more
> than the more common term "pets". Should it? The consensus is that
> such behavior is sub-optimal.
Definitely sub-optimal, but to my mind this points out the
shortcomings of TF/IDF when used with Boolean subqueries rather than
the downside of using a Boolean query for wildcards. I hit the same
problem when using Boolean OR's to search for common spelling errors.
Does one really want a search for "speling OR spelling" to prefer
the mis-speling?
In both of these cases, one does not want automatically prefer the
rarer word. My guess would be that any generated query (and thus from
a practical point of view, any Boolean query) does not want this
behaviour. It's only when dealing directly with user entered
keywords that this is a good choice.
In my opinion, one wants the parser to have access to the TF
information and to (optionally) use it when creating the query. And
one wants the the IDF information to be available to the scorer for
it's optional use. But the scorer should not care directly about TF,
only about the weight that has been input for each query term.
> > Of course, I say
> > that because I'm going to want them one day for my own nefarious
> > purposes, and with flexible scoring at that.
>
> Another reason for core KS to concentrate on providing a plugin
> scaffolding on which you can hang various KSx extensions, rather than
> a smorgasbord of Query subclasses.
Agreed. I don't think you need or want a built-in WildcardQuery
class. The core should provide rock solid Boolean components, and a
means of plugging in alternate parsers and scorers.
> TF/IDF is hard to beat as a default system.
TF/IDF is an excellent means for sorting a large database of full text
news articles by relevance based on naively entered keywords. To a
reasonable approximation, web search can be viewed in this light. But
its utility in other situations varies :).
> However, I'd like to make it possible to override, not just at search time,
> but at index time.
I'm not sure I understand this. Is this in the sense of making
certain parts of the index optional, or does it go deeper than this?
> That's the rationale behind the introduction of the abstract
> base classes KinoSearch::Index::Reader and
> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
> first distro to use these capabilities.
I've been watching the commits, but haven't really had an idea of
where you are going. Could you offer an overview when you have the
time?
Nathan Kurz
nate at verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list