[KinoSearch] Wildcards

Nathan Kurz nate at verse.com
Tue Jan 29 20:06:10 PST 2008



On 1/27/08, Marvin Humphrey <marvin at rectangular.com> wrote:
> IDF is known when compiling the Query to a Weight to a Scorer, but TF
> is per-document.  You aren't going to have access to TF at the Scorer-
> compilation stage.

Sometimes I worry that my arguments would be more persuasive I was
able to use common terms correctly.   :)  What I meant to say was that
the globals information doesn't need to be known by the query, only by
the Scorer.   The Query would deal with only the per-document data.
This seems to be how you correctly interpreted it, despite my
mangling.

> Or maybe the default TermQuery class can do flat scoring and
> TFIDFTermQuery would override?  I imagine that would make you happy. ;)

Given the smileys, I'm not sure if this is a joke or not.  To be
clear, this solution would make me ill.  My desire is to separate the
query from the scoring, so having a different Query class for each
possible scoring option is the antithesis of what I want.  What I want
is to have a number of independent Scorers that can be plugged into a
Scorer-agnostic set of Queries:  simple Queries, simple Scorers,
complex combinations.

> TF/IDF needs to continue to be the IR model you get when you fire up
> standard KS.  But the idea of focusing on pure boolean components is
> attractive.  It would be killer if we could abstract TF/IDF to a
> higher level.

Yes, yes, exactly this.  Although I do worry that I mean a different
thing by 'this' than you. :(  But regardless of how it is abstracted,
I applaud the desire.

> R-trees are a more efficient data structure for geospatial
> searching.   However, there's no RTreeWriter writing R-tree data to
> each segment in KS by default.  I'd like to write one and make it
> easy to integrate via InvIndexer/SegWriter.

This is a beautiful concrete example.  If KinoSearch was flexible enough to
accommodate this smoothly, it seems likely it would be able to
accommodate a very wide range of other uses as well.

> In order to improve search accuracy beyond the limits of TF/IDF,
> especially when dealing with large collections, we need to be able to
> scale up both by spreading to multiple machines AND by layering
> different IR models on top of each other.  That's where KS is headed,
> and as things progress, I'm more and more confident that it's going
> to work out well.

This seems like a wonderful goal!

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list