[KinoSearch] Possible Phrase Query Bug

Marvin Humphrey marvin at rectangular.com
Sun Sep 9 15:46:07 PDT 2007




On Sep 8, 2007, at 11:28 AM, Nathan Kurz wrote:

> On 9/7/07, Matthew O'Connor <matthew.oconnor at socialtext.com> wrote:
>> I think I've found a bug with phrase queries.

Matthew, excellent job producing a repeatable test scenario.

> The problem
> was an unsigned integer underflow if the first occurrence of a term
> was at a position less than its phrase offset.

Nathan, excellent job identifying and fixing the bug.

> I've fixed it by changing to using from a u32_t to an i32_t and
> casting the compare, which seemed like the least invasive change.

I appreciate the light touch. :)

What I actually committed was slightly different, and attempts to  
explain the algorithm more effectively:

             /* Discard positions that occur too early in the field  
to match as
              * a part of the phrase.  For example, if the field  
begins with
              * "The ants go marching one by one", that initial "The"  
cannot
              * match as the second term in a phrase search for
              * "fight the power".
              */
             target = phrase_offset;
             while (candidates < candidates_end && *candidates <  
target) {
                 candidates++;
             }
             if (candidates == candidates_end)
                 break;

> It does slightly reduce the range of legal positions to 31-bits, but I
> don't think this is going to catch anyone.

It's already limited to that by code elsewhere, and such a situation  
is pretty esoteric in any case.  With ordinary analyzers, you'd need  
to have a field value that was longer than 2 GB for that to come into  
play at all, and even then that assumes each byte is a token holding  
down a position.

Where people might theoretically have gotten tripped up is via  
accidentally running out of room after $token->set_pos_inc 
($large_number) during indexing.  I don't think limiting the range of  
usable positions to 31 bits impedes usability, but *checking* the  
range, as this bug illustrates, is a good idea.  I've added one such  
check with revision 2526.

> I didn't look at the .15
> version, but presume the problem can be fixed in the same simple
> manner.

Yes.  The algorithm hadn't changed.

I committed the changes to trunk as 2524 and to maint as 2525.  Thanks!

> There are probably more invasive alternatives if the range reduction
> is unacceptable.  Also, it looks possible to speed up the loop a bit,
> but probably at the cost of readability.  I'll hold off on trying this
> unless it's thought to be a bottleneck.

I'd be surprised if it were.  This code is already significantly  
different from the Lucene PhraseScorer code and theoretically faster.

In Lucene, each position is iterated over with a method call to  
termPositions.nextPosition(), and objects representing each term in  
the phrase are kept in a priority queue.  In KS, we load all the  
positions at once, then traverse the arrays with pointer math, using  
an "anchor set" algorithm not in Lucene.

The Lucene technique uses less memory.  The KS technique has less  
function-call overhead.  I'm not certain whether that was the right  
trade to make historically, but I am nearly certain that there's no  
way to do what we want to do with expanded position scoring without  
keeping all positions loaded.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list