[KinoSearch] Re: KinoSearch feature suggestions

Marvin Humphrey marvin at rectangular.com
Wed Jan 23 06:32:26 PST 2008




On Jan 22, 2008, at 5:04 PM, Father Chrysostomos wrote:

>> I am in favor of wildcards being available via a separate  
>> distribution, and I would very much like to hammer out an elegant  
>> low-level API to support such a distro.  A lot of the work I have  
>> been doing lately is intended to facilitate such endeavors.
>
> Is there anything I can do to help on the Perl side?

If things work out as I hope, we'll be able to prototype this in pure  
Perl. :)

Abstract methods for many KS base classes are now implemented by  
calling back from C to the host language (in this case, Perl).  These  
capabilities aren't well-documented or API-stable yet, but  
KinoSearch::Search::MockScorer serves as proof-of-concept.

We'll need at least three classes:

     * KSx::Search::WildCard::WildCardQuery
     * KSx::Search::WildCard::WildCardWeight
     * KSx::Search::WildCard::WildCardScorer

A Lucene-like implementation would work something like this:

     * Find all the terms that match the wildcard query string.
     * Create one PostingList per term, and put them in a  
PriorityQueue which
       sorts by $posting_list->get_doc_num.
     * Have $wild_card_scorer->next advance the priority queue.

There are some flaws in this design.  It tends to blow up if you  
match too many terms (Lucene throws an exception at 1024 by default  
rather than run the risk of an out-of-memory error).  Scoring is a  
bit of a bugaboo, but for now we can punt and just have  
WildCardScorer assess a fixed score of 0 for each doc.

However, with this design we can get something working reasonably  
quickly, and then check out prior art from other IR projects.  Peter,  
does SWISH do wildcards?

>> How about if we outsource excerpting to subclasses of a new class,  
>> KinoSearch::Highlight::Excerpter?
>
> I think I can have a patch for this in a couple of days.

Sweet.  :)

> But the *offsets* of the page breaks need to be recorded. Counting  
> is not sufficient. I still have to think more about how this should  
> work—unless you have some ideas.

We can modify that function to record offsets in a Perl array.  This  
(untested) variant renders those offsets as counts of Unicode code  
points:

     use Inline => C <<'END_C';

     AV*
     count_breaks(SV *input_sv) {
         STRLEN len;
         char *ptr     = SvPV_utf8(input_sv, len);
         char *start   = ptr;
         char *end     = SvEND(input_sv);
         long  offset  = 0;
         AV   *offsets = newAV();

         /* Accumulate an array of offsets for form-feed characters. */
         while (ptr < end) {
             if (*ptr++ == 12) {
                 SV *offset_sv = newSViv(offset);
                 av_push(offsets, offset_sv);
             }
             ptr += UTF8SKIP(ptr);
             offset++;
         }

         return offsets;
     }

     END_C

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list