[KinoSearch] Wildcards (was Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Wed Jan 23 22:31:36 PST 2008



On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:
> Yes, Swish-e supports the '*' to match 1 or more characters at the  
> end of the word, and
> '?' to match exactly one character anywhere in the word. However,  
> Swish-e does that by
> means of a 256-byte wide lookup table (iirc), which works only  
> because Swish-e supports
> single-byte encodings.

Interesting.  Is the lookup table used when matching terms, or also  
postings?  Is there a clear division between the two stages, as there  
is in KS?

Accumulating the set of terms that match the wildcard isn't that  
hard.  Here's a simple example with a trailing wildcard.

   my $wildcard_query_string = "pet*";
   my ($frag) = $wildcard_query_string =~ /(.*?)\*/;
   my $lexicon = $index_reader->look_up_field("content");
   $lexicon->seek($frag);
   my @terms;
   while ( $lexicon->next ) {
     my $term = $lexicon->get_term;
     last unless index( $term->get_text, $frag ) == 0;
     push @terms, $term;
   }

 From there, we assemble a priority queue of PostingLists:

   my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar  
@terms );
   for my $term (@terms) {
     my $posting_list = $index_reader->posting_list($term);
     $pri_q->insert($posting_list);
   }

The problem we have now is that the priority queue of PostingLists  
probably isn't a good way to zip through a lot of matching terms.   
There's going to be some disk seeking, as the results for "peter" and  
"petroleum" and "petunia" are interleaved.  Hmm...

If we punt on scoring, it might make sense from an i/o standpoint to  
iterate through all the matches up front and save a BitVector with  
matching doc nums set.

Actually, if we iterate up front, we could find out the IDF of the  
fragment and then use that to assess a crude score.  However, we  
wouldn't have TF info available  unless we use something bigger than  
a BitVector to hold the temporary results.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list