[KinoSearch] Wildcards
Nathan Kurz
nate at verse.com
Tue Jan 29 20:06:10 PST 2008
On 1/27/08, Marvin Humphrey <marvin at rectangular.com> wrote:
> IDF is known when compiling the Query to a Weight to a Scorer, but TF
> is per-document. You aren't going to have access to TF at the Scorer-
> compilation stage.
Sometimes I worry that my arguments would be more persuasive I was
able to use common terms correctly. :) What I meant to say was that
the globals information doesn't need to be known by the query, only by
the Scorer. The Query would deal with only the per-document data.
This seems to be how you correctly interpreted it, despite my
mangling.
> Or maybe the default TermQuery class can do flat scoring and
> TFIDFTermQuery would override? I imagine that would make you happy. ;)
Given the smileys, I'm not sure if this is a joke or not. To be
clear, this solution would make me ill. My desire is to separate the
query from the scoring, so having a different Query class for each
possible scoring option is the antithesis of what I want. What I want
is to have a number of independent Scorers that can be plugged into a
Scorer-agnostic set of Queries: simple Queries, simple Scorers,
complex combinations.
> TF/IDF needs to continue to be the IR model you get when you fire up
> standard KS. But the idea of focusing on pure boolean components is
> attractive. It would be killer if we could abstract TF/IDF to a
> higher level.
Yes, yes, exactly this. Although I do worry that I mean a different
thing by 'this' than you. :( But regardless of how it is abstracted,
I applaud the desire.
> R-trees are a more efficient data structure for geospatial
> searching. However, there's no RTreeWriter writing R-tree data to
> each segment in KS by default. I'd like to write one and make it
> easy to integrate via InvIndexer/SegWriter.
This is a beautiful concrete example. If KinoSearch was flexible enough to
accommodate this smoothly, it seems likely it would be able to
accommodate a very wide range of other uses as well.
> In order to improve search accuracy beyond the limits of TF/IDF,
> especially when dealing with large collections, we need to be able to
> scale up both by spreading to multiple machines AND by layering
> different IR models on top of each other. That's where KS is headed,
> and as things progress, I'm more and more confident that it's going
> to work out well.
This seems like a wonderful goal!
Nathan Kurz
nate at verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list