[KinoSearch] Wildcards (was: Re: KinoSearch feature suggestions)

Father Chrysostomos sprout at cpan.org
Fri Jan 25 08:38:38 PST 2008




On Jan 25, 2008, at 2:26 AM, Marvin Humphrey wrote:

>
> On Jan 24, 2008, at 8:20 PM, Father Chrysostomos wrote:
>
>> I’m trying to have a go at this.
>>
>> How many times is the disk accessed when one does a boolean search  
>> (e.g., 'this OR that OR the-other')? And what are those times?
>
> The stack is pretty deep.  The Perl side looks something like...
>
>   KinoSearch::Search::Searchable::search
>   KinoSearch::Searcher::top_docs
>   KinoSearch::Searcher::collect
>
I was wondering whether it would be just as efficient to create a  
BooleanQuery as Mr. Kurz suggested, but I see the problem with the IDFs.

>> I could find the answer myself by reading more source code, but  
>> it’s awfully time consuming....
>
> In order to create legitimate subclasses to implement WildCard  
> queries, a bunch of stuff that isn't yet public will have to become  
> public.  I'm starting that off by exposing the Lexicon class, along  
> with the factory method $index_reader->blank_lexicon($field_name).

I think I’m confused as to what the lexicon is for. In your earlier  
example, you used

	my $lexicon = $reader->look_up_field($field);

so it appears that $lexicon is a pointer (not in the C sense) into the  
index for the list of terms in that field. Why would we need to create  
a blank one? Or is the idea to have a lexicon that covers multiple  
fields?


Another thing: Since 
‘pet*’ is essentially a type of simple regular  
expression, why not provide support for Regexp queries? It should be  
no less efficient if we look for a literal prefix (completely untested):

     # get the literal prefix of the regexp, if any.
     if($self->{re} =~
         /^
             (?:    # prefix for qr//'s, without allowing /i :
                 \(\? ([a-hj-z]*) (?:-[a-z]*)?:
             )?
             (\\[GA]|\^) # anchor
             ([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or  
comments)
         /x
     ) {{
         my ($mod,$anchor,$prefix) = ($1,$2,$3);
	$anchor eq '^' and $mod =~ /m/ and last;
	$mod =~ /x/ and $prefix =~ s/\s+//g;
         $self->{prefix} = $prefix;
     }}

Then a wild card query could be a subclass that does the following to  
its input:

$str = quotemeta $str;
for($str) {
	s/\\\*/.*/g;
	s/(?:\.\*){2,}/.*/g;
	s/^/^/;
	s/\z/\\z/;
}



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list