[KinoSearch] Wildcards (was: Re: KinoSearch feature suggestions)

Marvin Humphrey marvin at rectangular.com
Fri Jan 25 11:34:43 PST 2008




On Jan 25, 2008, at 8:38 AM, Father Chrysostomos wrote:

> I think I’m confused as to what the lexicon is for.

I've just committed some revised docs for Lexicon.  Please let me  
know if this is sufficiently clear:

     KinoSearch::Index::Lexicon - Iterator for a field's Terms.

     =head1 SYNOPSIS

         my $lexicon = $index_reader->blank_lexicon('content');
         while ( $lexicon->next ) {
            print $lexicon->get_term->get_text . "\n";
         }

     =head1 DESCRIPTION

     A Lexicon is an iterator which provides access to all the unique  
Terms
     for a given field in sorted order.

     If an index consists of two documents with a 'content' field  
holding
     "three blind mice" and "three musketeers" respectively, then the  
code
     from the synopsis above would print this:

         blind
         mice
         musketeers
         three

> In your earlier example, you used
>
> 	my $lexicon = $reader->look_up_field($field);

I had just changed the name of this method to be "blank_lexicon".

   my $lexicon = $reader->blank_lexicon($field);

However, that name choice seems to have been a bad call, judging by  
your reaction:

>  Why would we need to create a blank one?

Let's see if we can't come up with a less confusing solution.

Here's the rationale for the current names:

IndexReader provides factory methods which supply both Lexicons and  
PostingLists.

    my $seeked_lexicon        = $reader->lexicon($term);
    my $unseeked_lexicon      = $reader->blank_lexicon($field);
    my $seeked_posting_list   = $reader->posting_list($term);
    my $unseeked_posting_list = $reader->blank_posting_list($field);

blank_lexicon() and blank_posting_list() both return uninitialized  
iterators.  lexicon() and posting_list() both return iterators which  
have been pre-located via seek($term).

With the name change, I'd hoped to eliminate any ambiguity as to  
whether "look_up_field" would return a Lexicon or a PostingList, and  
also psychologically reinforce the implication of parallel behavior  
between "blank_lexicon" and "blank_posting_list".

One possible fix would be to supply only two methods, and have them  
behave polymorphically based on whether a Term or a plain string  
representing a field name was supplied:

    my $seeked_lexicon        = $reader->lexicon($term);
    my $unseeked_lexicon      = $reader->lexicon($field);
    my $seeked_posting_list   = $reader->posting_list($term);
    my $unseeked_posting_list = $reader->posting_list($field);

That would be fine at the Perl level.  However, I don't want to do  
that at the C level, where these methods are actually implemented,  
and it would be best if the C API and the Perl API resembled each  
other as closely as possible.

So here's my latest thought:

    my $seeked_lexicon        = $reader->lexicon( 'content', 'foo' );
    my $unseeked_lexicon      = $reader->lexicon( 'content', undef );
    my $seeked_posting_list   = $reader->posting_list( 'content',  
'foo' );
    my $unseeked_posting_list = $reader->posting_list( 'content',  
undef );

Does that make sense?

> Or is the idea to have a lexicon that covers multiple fields?

Lexicons explicitly do *not* cover multiple fields.  In this, they  
differ from the corresponding Lucene class, TermEnum, which always  
iterates over all terms in all fields, sorting first by field name  
then by term text.  The file formats differ as well: for KS, each  
field gets a dedicated file for simplicity's sake, while in Lucene,  
values for multiple fields coexist within the same file.

There are two reasons behind this:

   1. At some point, I intend to add support for specifying custom
      sorting routines via FieldSpec, which would affect the order
      in which the Terms in the Lexicon are sorted.  Managing multiple
      comparison functions within a single Lexicon object would be
      complex and ugly.
   2. At some point, it would be nice to support non-text fields.

Hmm.  Thinking over the second point, perhaps it would be best if  
Lexicons only stored field values rather than terms.  In Lucene, that  
wouldn't work because TermEnum objects handle multiple fields, but in  
KS, the field is fixed.

Making such a change wouldn't be trivial, but it's probably worthwhile.

> Another thing: Since 
‘pet*’ is essentially a type of simple  
> regular expression, why not provide support for Regexp queries? It  
> should be no less efficient if we look for a literal prefix  
> (completely untested):
>
>     # get the literal prefix of the regexp, if any.
>     if($self->{re} =~
>         /^
>             (?:    # prefix for qr//'s, without allowing /i :
>                 \(\? ([a-hj-z]*) (?:-[a-z]*)?:
>             )?
>             (\\[GA]|\^) # anchor
>             ([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or  
> comments)
>         /x
>     ) {{
>         my ($mod,$anchor,$prefix) = ($1,$2,$3);
> 	$anchor eq '^' and $mod =~ /m/ and last;
> 	$mod =~ /x/ and $prefix =~ s/\s+//g;
>         $self->{prefix} = $prefix;
>     }}

A RegexQuery class would be nice to have, but it would have some  
significant limitations.  If it used the existing KS index data  
structures, it would not behave like a typical SQL regex or LIKE  
query, matching the regex against the non-tokenized contents for each  
field.  If you did something like this...

   my $regex_query = KSx::Search::RegexQuery->new(
     field => 'content',
     regex => qr/three blind/,
   );

... and the 'content' field was tokenized, the regex wouldn't match  
against any of the values in the Lexicon, since e.g. "blind" doesn't  
match qr/three blind/.

In contrast, a trailing wildcard is relatively easy to implement,  
because you can find all the terms that match with this routine:

   my ($wild_text) = "pet*" =~ /^(.*?)\*/;
   my $lexicon     = $index_reader->posting_list( 'content',  
$wild_text );
   while (1) {
     my $term = $lexicon->get_term;
     last unless $term;
     my $term_text = $term->get_text;
     last unless index( $term_text, $wild_text ) == 0;
     push @term_texts, $term_text;
     last unless $lexicon->next;
   }

A non-trailing wildcard is more expensive to implement, because it  
requires iterating over the entire Lexicon.

> Then a wild card query could be a subclass that does the following  
> to its input:
>
> $str = quotemeta $str;
> for($str) {
> 	s/\\\*/.*/g;
> 	s/(?:\.\*){2,}/.*/g;
> 	s/^/^/;
> 	s/\z/\\z/;
> }

Using regexes to modify regexes is probably not something I would  
have thought to do, but that's all the better.  My goal is to provide  
the scaffolding.  I'm focused on getting Lexicon and PostingList  
right, so that you can use or abuse them as you wish.  :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list