[KinoSearch] Wildcards

Marvin Humphrey marvin at rectangular.com
Sun Jan 27 20:20:00 PST 2008




On Jan 25, 2008, at 12:57 PM, Father Chrysostomos wrote:

>>  2. At some point, it would be nice to support non-text fields.
>
> Since binary data can be stored in a string, it is already  
> supported, is it not?

No, that's not the case.  In the devel branch, non-binary fields are  
forced to UTF-8 very early on, in InvIndexer::add_doc.

>> Hmm.  Thinking over the second point, perhaps it would be best if  
>> Lexicons only stored field values rather than terms.  In Lucene,  
>> that wouldn't work because TermEnum objects handle multiple  
>> fields, but in KS, the field is fixed.
>
> Do you mean that the field contains the terms, which contain the  
> field name? This does seem redundant.

The KS implementation is not inefficient, it's just a little bloated.

Both the Lucene and KS file formats store the text change for each  
use string deltas, so encoding "petard" followed by "petunia" looks  
something like this:

    my $prefix_length = 3;
    my $diff_length   = 4;
    $outstream->print( pack( 'wwa*', $prefix_length, $diff_length,  
"unia" ) );

Lucene needs one extra compressed integer per entry to represent the  
field number -- not a big deal.  The real bottleneck in search occurs  
when processing postings, not when scanning through lexicon data.

>> Making such a change wouldn't be trivial, but it's probably  
>> worthwhile.
>
> That would certainly make things simpler. Of course, it’s up to you.

I'll give it a whirl.

I'm thinking that Lexicon's get_term() method will probably go away  
in favor of two new methods.

    my $term_text = $lexicon->get_value;
    my $value_obj = $lexicon->peek_value;

For now, only get_value will be public.  Internally, KS will use  
peek_value.

At the C level, Lex_Peek_Value() will return a pointer to the lexicon- 
 >value member variable.  In anticipation of extending support for  
non-text fields, we'll make that an Obj*.  For now, it will always be  
a ByteBuf.

   Obj*
   Lex_peek_value(Lexicon *self)
   {
      return self->value;
   }

Before the iterator starts and after it finishes, Lex_Peek_Value will  
return NULL.

get_value() will be XS-only.  It will return either undef, a perl  
object representing lexicon->value, or, if lexicon->value is a  
ByteBuf, a plain Perl scalar string.

     SV*
     get_value(self)
         kino_Lexicon *self;
     CODE:
     {
         kino_Obj *value = Kino_Lex_Peek_Value(self);
         if (value == NULL) {
             RETVAL = newSV(0);
         }
         else if (KINO_OBJ_IS_A(value, KINO_BYTEBUF)) {
             RETVAL = bb_to_sv(value);
         }
         else {
             RETVAL = Kino_Obj_To_Native(value);
         }
     }
     OUTPUT: RETVAL

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list