[KinoSearch] Dynamic schemas - How?

Marvin Humphrey marvin at rectangular.com
Thu Mar 1 10:17:23 PST 2007


(Apologies for the previous empty message, courtesy of an errant  
mouse click).

Marc,

Thanks very much for speaking up.  It's possible to revise Schema now  
with no real penalty beyond the breakage 0.20 introduces.

If anyone else has any suggestions (: or gripes :) about KinoSearch's  
API, NOW is the time to make them known.

After your feedback about Schema, I've decided to give it a minor  
overhaul.

As currently implemented, init_fields() doesn't do much except  
generate a hash and perform some verification.  If we move the  
verification routines to the constructor, then we can just replace  
init_fields() with a required variable, %FIELDS.

    our %FIELDS = (
        title   => 'KinoSearch::Schema::FieldSpec,
        content => 'KinoSearch::Schema::FieldSpec',
        url     => 'UnAnalyzedFieldSpec',
    );

add_field() will work as described earlier -- it will be an instance  
method only.

    my $schema = MySchema->new;
    $schema->add_field( $_ => 'CustomSpec' ) for @dynamic_fields;

I think that's a better API.  add_field() and init_field() were  
confusingly similar.  Now there will be no confusion as to what's  
class data and what's instance data.

I've also decided to try to make it possible to call add_field() at  
any time during indexing.

    my $schema   = MySchema->new;
    my $invindexer = KinoSearch::InvIndexer->new(
        invindex => $schema->open('/path/to/invindex'),
    );
    while ( my $doc = get_doc_hashref_from_somewhere() ) {
        $schema->add_field( $_ => 'CustomSpec' ) for keys %$doc;
        $invindexer->add_doc($doc);
    }

Adding the same field name over and over again with add_field() won't  
be an error, unless you try to switch up the FieldSpec subclass it's  
associated with -- once a field is associated with a given classname,  
it's forever as far as that Schema object and that invindex.

This is a substantial change in how KinoSearch thinks about the index  
structure, and we'll have to sacrifice some validation here and  
there.  But the nice thing is that it won't be necessary to add a  
DeepFieldSpec class -- people who need to fake one-to-many  
relationships will be able to hack that up on their own.

>> For instance, InvIndexer->delete_by_term verifies that the field  
>> in question 1) is known, and 2) is spec'd as indexed.  If it isn't  
>> known, you probably misspelled it; if it wasn't indexed, no docs  
>> will be found and no deletions will occur -- and that's something  
>> you probably want to know about.
> Of course it would be nice to know, but if you would use the  
> add_field method described at the end of the posting, this would be  
> solved too since then the index knows about it's fields, right?  
> Well you have to open the index first, but you have to do this  
> anyway if you want to add or delete anything.

The thing is, you might not know whether or not you've added a given  
field to your schema.  I could add a has_field() instance method to  
Schema, but I doubt people will go to the trouble of using it. :)  So  
we just have to live with the mildly reduced default level of safety.

> Of course, if I have the possibilty to just open the index and know  
> about the fields used, I could dump my code which builds the schema  
> on-the-fly which would be great because it would simplify the  
> searcher code.

I'll take care of it.  :)  New documentation for Schema->open...

=head2 open

     my $invindex = MySchema->open('/path/to/invindex');

Open an existing invindex for either reading or updating.  All fields  
which
have ever been defined for this invindex will be loaded/verified via
add_field().

=cut

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list