[KinoSearch] Wildcards
Marvin Humphrey
marvin at rectangular.com
Sun Jan 27 20:20:00 PST 2008
On Jan 25, 2008, at 12:57 PM, Father Chrysostomos wrote:
>> 2. At some point, it would be nice to support non-text fields.
>
> Since binary data can be stored in a string, it is already
> supported, is it not?
No, that's not the case. In the devel branch, non-binary fields are
forced to UTF-8 very early on, in InvIndexer::add_doc.
>> Hmm. Thinking over the second point, perhaps it would be best if
>> Lexicons only stored field values rather than terms. In Lucene,
>> that wouldn't work because TermEnum objects handle multiple
>> fields, but in KS, the field is fixed.
>
> Do you mean that the field contains the terms, which contain the
> field name? This does seem redundant.
The KS implementation is not inefficient, it's just a little bloated.
Both the Lucene and KS file formats store the text change for each
use string deltas, so encoding "petard" followed by "petunia" looks
something like this:
my $prefix_length = 3;
my $diff_length = 4;
$outstream->print( pack( 'wwa*', $prefix_length, $diff_length,
"unia" ) );
Lucene needs one extra compressed integer per entry to represent the
field number -- not a big deal. The real bottleneck in search occurs
when processing postings, not when scanning through lexicon data.
>> Making such a change wouldn't be trivial, but it's probably
>> worthwhile.
>
> That would certainly make things simpler. Of course, it’s up to you.
I'll give it a whirl.
I'm thinking that Lexicon's get_term() method will probably go away
in favor of two new methods.
my $term_text = $lexicon->get_value;
my $value_obj = $lexicon->peek_value;
For now, only get_value will be public. Internally, KS will use
peek_value.
At the C level, Lex_Peek_Value() will return a pointer to the lexicon-
>value member variable. In anticipation of extending support for
non-text fields, we'll make that an Obj*. For now, it will always be
a ByteBuf.
Obj*
Lex_peek_value(Lexicon *self)
{
return self->value;
}
Before the iterator starts and after it finishes, Lex_Peek_Value will
return NULL.
get_value() will be XS-only. It will return either undef, a perl
object representing lexicon->value, or, if lexicon->value is a
ByteBuf, a plain Perl scalar string.
SV*
get_value(self)
kino_Lexicon *self;
CODE:
{
kino_Obj *value = Kino_Lex_Peek_Value(self);
if (value == NULL) {
RETVAL = newSV(0);
}
else if (KINO_OBJ_IS_A(value, KINO_BYTEBUF)) {
RETVAL = bb_to_sv(value);
}
else {
RETVAL = Kino_Obj_To_Native(value);
}
}
OUTPUT: RETVAL
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list