[KinoSearch] Re: KinoSearch feature suggestions
Marvin Humphrey
marvin at rectangular.com
Wed Jan 23 06:32:26 PST 2008
On Jan 22, 2008, at 5:04 PM, Father Chrysostomos wrote:
>> I am in favor of wildcards being available via a separate
>> distribution, and I would very much like to hammer out an elegant
>> low-level API to support such a distro. A lot of the work I have
>> been doing lately is intended to facilitate such endeavors.
>
> Is there anything I can do to help on the Perl side?
If things work out as I hope, we'll be able to prototype this in pure
Perl. :)
Abstract methods for many KS base classes are now implemented by
calling back from C to the host language (in this case, Perl). These
capabilities aren't well-documented or API-stable yet, but
KinoSearch::Search::MockScorer serves as proof-of-concept.
We'll need at least three classes:
* KSx::Search::WildCard::WildCardQuery
* KSx::Search::WildCard::WildCardWeight
* KSx::Search::WildCard::WildCardScorer
A Lucene-like implementation would work something like this:
* Find all the terms that match the wildcard query string.
* Create one PostingList per term, and put them in a
PriorityQueue which
sorts by $posting_list->get_doc_num.
* Have $wild_card_scorer->next advance the priority queue.
There are some flaws in this design. It tends to blow up if you
match too many terms (Lucene throws an exception at 1024 by default
rather than run the risk of an out-of-memory error). Scoring is a
bit of a bugaboo, but for now we can punt and just have
WildCardScorer assess a fixed score of 0 for each doc.
However, with this design we can get something working reasonably
quickly, and then check out prior art from other IR projects. Peter,
does SWISH do wildcards?
>> How about if we outsource excerpting to subclasses of a new class,
>> KinoSearch::Highlight::Excerpter?
>
> I think I can have a patch for this in a couple of days.
Sweet. :)
> But the *offsets* of the page breaks need to be recorded. Counting
> is not sufficient. I still have to think more about how this should
> work—unless you have some ideas.
We can modify that function to record offsets in a Perl array. This
(untested) variant renders those offsets as counts of Unicode code
points:
use Inline => C <<'END_C';
AV*
count_breaks(SV *input_sv) {
STRLEN len;
char *ptr = SvPV_utf8(input_sv, len);
char *start = ptr;
char *end = SvEND(input_sv);
long offset = 0;
AV *offsets = newAV();
/* Accumulate an array of offsets for form-feed characters. */
while (ptr < end) {
if (*ptr++ == 12) {
SV *offset_sv = newSViv(offset);
av_push(offsets, offset_sv);
}
ptr += UTF8SKIP(ptr);
offset++;
}
return offsets;
}
END_C
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list