[KinoSearch] Wildcards

Marvin Humphrey marvin at rectangular.com
Fri Jan 25 02:28:15 PST 2008




On Jan 23, 2008, at 11:47 PM, Nathan Kurz wrote:

> Don't punt on the scoring!

Well, here's the problem, which afflicts the current implementation  
of wildcards in Lucene.  If we transform the wildcard into an array  
of TermQuery objects, then each of them has an individual IDF -- so  
in a search for "pet*", the rare term "petard" will contibute more  
than the more common term "pets".  Should it?  The consensus is that  
such behavior is sub-optimal.

> From my naive point of view, a wildcard just looks like another way of
> specifying a boolean OR.  Why not expand it out with the parser level?
>  Sure it might be really big, but there's nothing wrong with providing
> support for industrial strength boolean queries.

However any particular WildcardQuery gets implemented, it will need  
some sort of safety valve to prevent "a*" from swamping the server.

> Of course, I say
> that because I'm going to want them one day for my own nefarious
> purposes, and with flexible scoring at that.

Another reason for core KS to concentrate on providing a plugin  
scaffolding on which you can hang various KSx extensions, rather than  
a smorgasbord of Query subclasses.

>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score.
>
> I will be so appreciative some day if you move away from architectures
> that presumes IDF is always going to be the way that things are
> scored.

TF/IDF is hard to beat as a default system.  However, I'd like to  
make it possible to override, not just at search time, but at index  
time.  That's the rationale behind the introduction of the abstract  
base classes KinoSearch::Index::Reader and  
KinoSearch::Index::Writer.  My hope is to write KSx::RTree as the  
first distro to use these capabilities.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list