[KinoSearch] ProximityQuery
Peter Karman
peter at peknet.com
Sun Mar 21 18:50:10 PDT 2010
Marvin Humphrey wrote on 3/21/10 3:07 PM:
> On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
>> Marvin, please have a look when you have a chance, and let me know what needs
>> changing.
>
> The current implementation has a limitation I think is probably pretty
> important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.
>
As you noted earlier in this thread, there is no concensus about what a
proximity query is. :)
I did consider the fact that proximity might imply that order does not matter.
But I came down here: if I want order to matter, and the ProximityScorer ignores
order as you're suggesting, then I have no options. I can't limit my search to
'a NEAR b'.
If instead we leave the ProximityScorer as is, then this:
(a NEAR b) OR (b NEAR a)
does what you're describing.
Consider too:
(a NEAR b NEAR c)
which might be written as:
"a b c"~10
What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or
'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities
multiply.
I think simpler is better here: if you want order to not matter, then OR
together the various orders you might be interested in. In fact, I may offer
that as an option in the Search::Query::Parser, which could then do the ORing
programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS
QueryParser, could do something similar.
I note that one of the Lucene classes you mentioned earlier[0] makes inOrder an
option. The Lucene PhraseScorer's slop feature, however, does seem to respect
order with no option otherwise.
[0]
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanNearQuery.java
>
> Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
> but "proximity" is better) instead of "near" for the name of that parameter.
> Or alternately, "slop", but I understand why you went with nearness instead.
I like 'proximity' for consistency's sake. And yes, 'near' is not quite right.
How about 'within'? Or 'vicinity'?
>
>> In the end it was a one-line difference in the SI_winnow_anchors implementation
>> to get the near/slop feature working. I left the original implementation intact
>> and put a switch/case wrapper around it to leave the optimization (if any)
>> intact for phrases (near==1).
>
> This doesn't technically need changing, but to cut down on the duplicated
> code, the switch on self->near should theoretically happen here:
ah yes, that's much better.
--
Peter Karman . http://peknet.com/ . peter at peknet.com
More information about the kinosearch
mailing list