[KinoSearch] more abstract interfaces to kinosearch

Nathan Kurz nate at verse.com
Mon Jul 2 18:30:53 PDT 2007


On 7/2/07, Hans Dieter Pearcey <hdp at pobox.com> wrote:
> A basic overview of how the different components of searching work together
> would make it much easier for me to understand each individual piece.

Hi Hans ---

I've been feeling much the same way, although I think the problem
isn't really with KinoSearch but with the Lucene architecture that it
is based on:  it's hairy and convoluted.  On the bright side, it's
pretty well documented.  I found this page after I'd already spelunked
the KinoSearch code:
<http://lucene.apache.org/java/docs/scoring.html>.  Apart from some
small differences in class names, it describes KinoSearch as well.
And aside from the formatting problems, the notes I made while trying
to figure out the code might help as well:
http://www.rectangular.com/pipermail/kinosearch/2007-June/001005.html

> I wouldn't mind trying to organize anything I learn
> into part of the manual, though documentation isn't my strong point.
> Docs::DevGuide seems like a reasonable place for an "overview of KinoSearch
> and how the classes fit together" section.

Not to discourage you from writing documentation, but I'm personally
of the opinion that documenting the current state is less important
than simplifying the way it works.  The Lucene code is well
documented, but still (to my way of thinking) almost impenetrable.

With you trying to do scoreless matching, me trying to implement
straight proximity scoring, and Marvin on top of the current scoring
method, it seems like we ought to be able to come up with a generic
system that can be easily extended to cover most needs.

> I don't think I want KinoSearch to do more than it's designed to.  In
> some ways, the difficulty I've had grasping the details has been more
> frustrating specifically because it is clearly very modular; if it were a big
> chunk of unmalleable code, I'd give up and move on.

Maybe you could send something trying to describe more exactly what
you are trying to accomplish, with less emphasis on how you are trying
to do it?  My impression from reading your exchange with Marvin is
that what you want (unlike what I want) is indeed possible without any
changes to the C code or architecture, although without optimal
efficiency.

The 'SHOULD' clauses of the BooleanQuery are handled by the code in
ORScorer.c, which uses a common-English (as opposed to
logical/short-circuit) OR.  All the clauses are checked, even if an
earlier clause matched. It might help to think of it as a "Some" or
"Sum" query:  only documents that match at least one of the clauses
are passed on, and the sum of the subqueries is used as the total
score.

To get this to work for your purposes, you either need to
change/subclass ORScorer.c to return a constant score, or get its
subqueries to return 0 for a score so that it doesn't matter how many
of them match.   Changing ORScorer would require working at the C
level, but I think you should be able to get the subscores to zero by
setting 'boost' to zero somewhere.

This is a somewhat inefficient hack (as it calculates the full score
before multiplying it by 0), but from what I can tell about your goals
should work.  Alternatively, if you are doing away with scoring
altogether, you might be able to use a SortCollector instead of a
TopDocCollector.  I haven't played with it, but I think it completely
ignores the returned score and lets you specify your own sorting
function.

Hope this helps,

Nathan Kurz
nate at verse.com



More information about the KinoSearch mailing list