[KinoSearch] Wildcards (was Re: KinoSearch feature suggestions)
Marvin Humphrey
marvin at rectangular.com
Wed Jan 23 22:31:36 PST 2008
On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:
> Yes, Swish-e supports the '*' to match 1 or more characters at the
> end of the word, and
> '?' to match exactly one character anywhere in the word. However,
> Swish-e does that by
> means of a 256-byte wide lookup table (iirc), which works only
> because Swish-e supports
> single-byte encodings.
Interesting. Is the lookup table used when matching terms, or also
postings? Is there a clear division between the two stages, as there
is in KS?
Accumulating the set of terms that match the wildcard isn't that
hard. Here's a simple example with a trailing wildcard.
my $wildcard_query_string = "pet*";
my ($frag) = $wildcard_query_string =~ /(.*?)\*/;
my $lexicon = $index_reader->look_up_field("content");
$lexicon->seek($frag);
my @terms;
while ( $lexicon->next ) {
my $term = $lexicon->get_term;
last unless index( $term->get_text, $frag ) == 0;
push @terms, $term;
}
From there, we assemble a priority queue of PostingLists:
my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar
@terms );
for my $term (@terms) {
my $posting_list = $index_reader->posting_list($term);
$pri_q->insert($posting_list);
}
The problem we have now is that the priority queue of PostingLists
probably isn't a good way to zip through a lot of matching terms.
There's going to be some disk seeking, as the results for "peter" and
"petroleum" and "petunia" are interleaved. Hmm...
If we punt on scoring, it might make sense from an i/o standpoint to
iterate through all the matches up front and save a BitVector with
matching doc nums set.
Actually, if we iterate up front, we could find out the IDF of the
fragment and then use that to assess a crude score. However, we
wouldn't have TF info available unless we use something bigger than
a BitVector to hold the temporary results.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list