[KinoSearch] more abstract interfaces to kinosearch
Marvin Humphrey
marvin at rectangular.com
Tue Jul 3 01:09:57 PDT 2007
On Jul 2, 2007, at 2:16 PM, Nathan Kurz wrote:
> The hierarchy I'm working with is much flatter and more
> straightforward: a reusable Query produces an index-specific Scorer
> that returns a HitCollector:
> no Searcher,
Hmm. I don't see the advantage in that.
> no Weight objects,
Here is the rationale behind Weight, from the original commit message
in January 2003.
a. Queries are no longer modified during a search. This makes
it possible, e.g., to reuse the same query instance with
multiple indexes from multiple threads.
I can see doing away with Weight. Then our queries would look more
like Plucene's.
Note that you have to modify the Query itself. If you try to wait
until creating the Scorer to perform the weighting, you run into
problems with MultiSearcher, specifically with the calculation of
IDF. The MultiSearcher object knows how common a term is across all
the sub-indexes *combined*. That's information that each individual
IndexReader doesn't have access to.
> no Similarity object,
For large corpuses, optimal results are usually only obtained if you
use different length_norm functions for short fields like "title" and
longer fields like "content". See the explanation in the docs for
LongFieldSim:
http://www.rectangular.com/kinosearch/docs/devel/KSx/Search/
LongFieldSim.html
I know that length normalization doesn't matter for your application,
but it's key for standard tf/idf.
I think Doug Cutting nailed this aspect of the design. Concentrating
as much IR theory as possible in one class was absolutely the right
move, IMO. This stuff would be harder to understand, test,
experiment with, or improve if it was spread out over the whole library.
> no twisty mazes,
:)
> no delayed inits,
Any of these that can go away, I say good riddance. They are usually
artifacts of a simplified external API, though. The delayed init
spares the caller from the responsibility of invoking some init
routine manually before a loop begins.
> and no Hits->seek.
I'd love to see a patch for this one. It would be an excellent start.
> It's probably not quite generic enough for general use,
> but I think it's going to be possible to move it in that direction
> once I get it working. It's certainly simpler to comprehend, and I
> think it might end up more efficient as well.
What optimizations have you found?
> I started along the path of trying to subclass the existing Scoring
> classes to make them work the way I wanted, but it wasn't working
> well. Line by line the existing code is great to work with, but the
> overall hierarchy hardcodes TF/IDF scoring at a fairly deep level.
> While it's been slow, I've been much happier trying to factor this out
> rather than overriding it class by class. And by redoing it, I'm
> getting a much clearer understanding about how the current setup
> works.
>
> I've got tons of questions, but I'll limit myself to two for now:
>
> 1) Is there a good reason to keep BooleanScorer at the C level, rather
> than moving it up into Perl?
BooleanScorer's internal compilation phase could be handled in Perl.
There aren't any performance considerations.
A thought: what if we did away with BooleanQuery, replacing it with
ANDQuery, ORQuery, etc? I dunno that we want to open that can o'
worms when 0.20 is so close, though.
> 2) The current ORScorer calls Tally on its subscorers at the same time
> it is skipping through documents, rather than at the end of the phase.
> Is this a good practice that I should emulate? My instinct is that
> it would be inefficient for certain types of queries:
> ((expensive-phrase OR expensive-phrase) AND rare-filter).
That's true. However it's quite efficient for this:
((expensive-phrase OR expensive-phrase) AND rare-term).
That's because all the subscorers get to call Skip_To for docs that
match 'rare-term' and don't bother with building phrase matches for
docs that can't be hits.
> ps. I like the direction of KinoSearch::Simple, particularly the
> integration of the indexing and searching. I'm tempted to think that
> rather than calling it 'Simple', you should just call it 'KinoSearch'
> and eventually have it be the main API.
I disagree. Searching and indexing are completely distinct tasks.
http://en.wikipedia.org/wiki/God_object
It's fine if someone like Hans creates a convenience layer on top of
KS that violates separation of concerns, but the primary classes
should not be designed that way.
PS: Family's coming to visit, so this'll be my last big missive for a
while.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list