[KinoSearch] Possible Phrase Query Bug
Marvin Humphrey
marvin at rectangular.com
Sun Sep 9 15:46:07 PDT 2007
On Sep 8, 2007, at 11:28 AM, Nathan Kurz wrote:
> On 9/7/07, Matthew O'Connor <matthew.oconnor at socialtext.com> wrote:
>> I think I've found a bug with phrase queries.
Matthew, excellent job producing a repeatable test scenario.
> The problem
> was an unsigned integer underflow if the first occurrence of a term
> was at a position less than its phrase offset.
Nathan, excellent job identifying and fixing the bug.
> I've fixed it by changing to using from a u32_t to an i32_t and
> casting the compare, which seemed like the least invasive change.
I appreciate the light touch. :)
What I actually committed was slightly different, and attempts to
explain the algorithm more effectively:
/* Discard positions that occur too early in the field
to match as
* a part of the phrase. For example, if the field
begins with
* "The ants go marching one by one", that initial "The"
cannot
* match as the second term in a phrase search for
* "fight the power".
*/
target = phrase_offset;
while (candidates < candidates_end && *candidates <
target) {
candidates++;
}
if (candidates == candidates_end)
break;
> It does slightly reduce the range of legal positions to 31-bits, but I
> don't think this is going to catch anyone.
It's already limited to that by code elsewhere, and such a situation
is pretty esoteric in any case. With ordinary analyzers, you'd need
to have a field value that was longer than 2 GB for that to come into
play at all, and even then that assumes each byte is a token holding
down a position.
Where people might theoretically have gotten tripped up is via
accidentally running out of room after $token->set_pos_inc
($large_number) during indexing. I don't think limiting the range of
usable positions to 31 bits impedes usability, but *checking* the
range, as this bug illustrates, is a good idea. I've added one such
check with revision 2526.
> I didn't look at the .15
> version, but presume the problem can be fixed in the same simple
> manner.
Yes. The algorithm hadn't changed.
I committed the changes to trunk as 2524 and to maint as 2525. Thanks!
> There are probably more invasive alternatives if the range reduction
> is unacceptable. Also, it looks possible to speed up the loop a bit,
> but probably at the cost of readability. I'll hold off on trying this
> unless it's thought to be a bottleneck.
I'd be surprised if it were. This code is already significantly
different from the Lucene PhraseScorer code and theoretically faster.
In Lucene, each position is iterated over with a method call to
termPositions.nextPosition(), and objects representing each term in
the phrase are kept in a priority queue. In KS, we load all the
positions at once, then traverse the arrays with pointer math, using
an "anchor set" algorithm not in Lucene.
The Lucene technique uses less memory. The KS technique has less
function-call overhead. I'm not certain whether that was the right
trade to make historically, but I am nearly certain that there's no
way to do what we want to do with expanded position scoring without
keeping all positions loaded.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list