[KinoSearch] Wildcards (was: Re: KinoSearch feature suggestions)
Marvin Humphrey
marvin at rectangular.com
Fri Jan 25 11:34:43 PST 2008
On Jan 25, 2008, at 8:38 AM, Father Chrysostomos wrote:
> I think I’m confused as to what the lexicon is for.
I've just committed some revised docs for Lexicon. Please let me
know if this is sufficiently clear:
KinoSearch::Index::Lexicon - Iterator for a field's Terms.
=head1 SYNOPSIS
my $lexicon = $index_reader->blank_lexicon('content');
while ( $lexicon->next ) {
print $lexicon->get_term->get_text . "\n";
}
=head1 DESCRIPTION
A Lexicon is an iterator which provides access to all the unique
Terms
for a given field in sorted order.
If an index consists of two documents with a 'content' field
holding
"three blind mice" and "three musketeers" respectively, then the
code
from the synopsis above would print this:
blind
mice
musketeers
three
> In your earlier example, you used
>
> my $lexicon = $reader->look_up_field($field);
I had just changed the name of this method to be "blank_lexicon".
my $lexicon = $reader->blank_lexicon($field);
However, that name choice seems to have been a bad call, judging by
your reaction:
> Why would we need to create a blank one?
Let's see if we can't come up with a less confusing solution.
Here's the rationale for the current names:
IndexReader provides factory methods which supply both Lexicons and
PostingLists.
my $seeked_lexicon = $reader->lexicon($term);
my $unseeked_lexicon = $reader->blank_lexicon($field);
my $seeked_posting_list = $reader->posting_list($term);
my $unseeked_posting_list = $reader->blank_posting_list($field);
blank_lexicon() and blank_posting_list() both return uninitialized
iterators. lexicon() and posting_list() both return iterators which
have been pre-located via seek($term).
With the name change, I'd hoped to eliminate any ambiguity as to
whether "look_up_field" would return a Lexicon or a PostingList, and
also psychologically reinforce the implication of parallel behavior
between "blank_lexicon" and "blank_posting_list".
One possible fix would be to supply only two methods, and have them
behave polymorphically based on whether a Term or a plain string
representing a field name was supplied:
my $seeked_lexicon = $reader->lexicon($term);
my $unseeked_lexicon = $reader->lexicon($field);
my $seeked_posting_list = $reader->posting_list($term);
my $unseeked_posting_list = $reader->posting_list($field);
That would be fine at the Perl level. However, I don't want to do
that at the C level, where these methods are actually implemented,
and it would be best if the C API and the Perl API resembled each
other as closely as possible.
So here's my latest thought:
my $seeked_lexicon = $reader->lexicon( 'content', 'foo' );
my $unseeked_lexicon = $reader->lexicon( 'content', undef );
my $seeked_posting_list = $reader->posting_list( 'content',
'foo' );
my $unseeked_posting_list = $reader->posting_list( 'content',
undef );
Does that make sense?
> Or is the idea to have a lexicon that covers multiple fields?
Lexicons explicitly do *not* cover multiple fields. In this, they
differ from the corresponding Lucene class, TermEnum, which always
iterates over all terms in all fields, sorting first by field name
then by term text. The file formats differ as well: for KS, each
field gets a dedicated file for simplicity's sake, while in Lucene,
values for multiple fields coexist within the same file.
There are two reasons behind this:
1. At some point, I intend to add support for specifying custom
sorting routines via FieldSpec, which would affect the order
in which the Terms in the Lexicon are sorted. Managing multiple
comparison functions within a single Lexicon object would be
complex and ugly.
2. At some point, it would be nice to support non-text fields.
Hmm. Thinking over the second point, perhaps it would be best if
Lexicons only stored field values rather than terms. In Lucene, that
wouldn't work because TermEnum objects handle multiple fields, but in
KS, the field is fixed.
Making such a change wouldn't be trivial, but it's probably worthwhile.
> Another thing: Since
‘pet*’ is essentially a type of simple
> regular expression, why not provide support for Regexp queries? It
> should be no less efficient if we look for a literal prefix
> (completely untested):
>
> # get the literal prefix of the regexp, if any.
> if($self->{re} =~
> /^
> (?: # prefix for qr//'s, without allowing /i :
> \(\? ([a-hj-z]*) (?:-[a-z]*)?:
> )?
> (\\[GA]|\^) # anchor
> ([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or
> comments)
> /x
> ) {{
> my ($mod,$anchor,$prefix) = ($1,$2,$3);
> $anchor eq '^' and $mod =~ /m/ and last;
> $mod =~ /x/ and $prefix =~ s/\s+//g;
> $self->{prefix} = $prefix;
> }}
A RegexQuery class would be nice to have, but it would have some
significant limitations. If it used the existing KS index data
structures, it would not behave like a typical SQL regex or LIKE
query, matching the regex against the non-tokenized contents for each
field. If you did something like this...
my $regex_query = KSx::Search::RegexQuery->new(
field => 'content',
regex => qr/three blind/,
);
... and the 'content' field was tokenized, the regex wouldn't match
against any of the values in the Lexicon, since e.g. "blind" doesn't
match qr/three blind/.
In contrast, a trailing wildcard is relatively easy to implement,
because you can find all the terms that match with this routine:
my ($wild_text) = "pet*" =~ /^(.*?)\*/;
my $lexicon = $index_reader->posting_list( 'content',
$wild_text );
while (1) {
my $term = $lexicon->get_term;
last unless $term;
my $term_text = $term->get_text;
last unless index( $term_text, $wild_text ) == 0;
push @term_texts, $term_text;
last unless $lexicon->next;
}
A non-trailing wildcard is more expensive to implement, because it
requires iterating over the entire Lexicon.
> Then a wild card query could be a subclass that does the following
> to its input:
>
> $str = quotemeta $str;
> for($str) {
> s/\\\*/.*/g;
> s/(?:\.\*){2,}/.*/g;
> s/^/^/;
> s/\z/\\z/;
> }
Using regexes to modify regexes is probably not something I would
have thought to do, but that's all the better. My goal is to provide
the scaffolding. I'm focused on getting Lexicon and PostingList
right, so that you can use or abuse them as you wish. :)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list