[KinoSearch] ProximityQuery
Marvin Humphrey
marvin at rectangular.com
Wed Mar 17 09:04:59 PDT 2010
On Tue, Mar 16, 2010 at 10:14:32PM -0500, Peter Karman wrote:
> > Within the existing KS code base, PhraseScorer would be the closest thing
> > to what you want. It wasn't really built to handle nearness, but maybe it
> > can be adapted.
>
> My (perhaps naive) assumption was that a PhraseScorer isa ProximityScorer
> where proximity==1.
The present implementation of PhraseScorer not generalized for variable
proximity. It looks for exact matches...
got == wanted
... rather than checking for slop:
abs(got - wanted) < slop
I think it's possible to mod the position-matching algorithm without affecting
performance.[1] However, I'm concerned about two things.
First, the code is apparently not clear enough today for you to understand it
just by spelunking -- despite your substantial expertise, it was necessary to
ask on the list. That tells me we shouldn't be adding to it but rather
refactoring it for simplicity and clarity first.
(We don't need to worry about optimizing the matching algorithm further. It
was fast when I finished it, and then Nate went to town and streamlined it
further. So refactoring should focus on superficial organization and
comments.)
Second, everyone understands what constitutes an exact phrase match, but
there's no consensus about what constitutes a sloppy phrase match. I think
the core PhraseScorer should stay focused on canonical phrase matching rather
than branch out.
So, I think what we should do is clean up PhraseScorer so that it is clearer,
then create ProximityScorer by cloning and modding it. It's a mild violation
of DRY, but that doesn't bother me. All of us will benefit from the cleanup,
and you'll walk away with a thorough understanding of the algorithm and a
top-flight ProximityScorer.
> > Do you have an idea yet as to how you might publish this?
>
> I need to understand how the phrase matching is done currently (see above).
> If I could contribute it to the KS core, I'd be happy to. Otherwise, I
> imagine adding it to Search::Query::Dialect::KSx as another *Query type,
> joining the Wildcard features.
I think the core should be limited to canonical query types, and that
therefore ProximityQuery should be implemented as an extension. In the
interest of time and convenience, though, we should probably treat it the same
way that KSx::Search::Filter is treated today, and build it into the main
distro, just under KSx. Once we have a decent C API, we should seek to spin
it off.
Alternately, you could write a pure-Perl implementation, but then PhraseScorer
wouldn't get a housecleaning, and it would actually be a PITA for you to port
all that C code rather than dupe it and make limited modifications --
especially if not all the necessary information is available at the Perl level
(which it probably isn't.)
Marvin Humphrey
[1] We can change SI_winnow_anchors to be take a slop param, then case the
call to it with either 0 or non-zero. In the 0 case, an optimizing
compiler will have all the information it needs to build the exact-match
version. See <https://issues.apache.org/jira/browse/LUCY-99> for an
example of this technique.
More information about the kinosearch
mailing list