[KinoSearch] Serialized Schema

Marvin Humphrey marvin at rectangular.com
Sun Sep 30 21:40:22 PDT 2007



On Sun, Sep 30, 2007 at 03:56:21PM -0600, Nathan Kurz wrote:
> Simple tokenizers work for the the things I want to do, but I'm not
> sure a regex is really that generally useful.   How many useful
> regular expressions are we talking about here?   

There's the KS default:

     qr/\w+(?:'\w+)*/

Then there's WhiteSpaceTokenizer:

     qr/\S+/

Then there's the Lucene StandardTokenizer, which is implemented using javacc.
Plucene emulates it with regexes:

    # Don't blame me, blame the Plucene people!
    my $alpha      = qr/\p{IsAlpha}+/;
    my $apostrophe = qr/$alpha('$alpha)+/;
    my $acronym    = qr/$alpha\.($alpha\.)+/;
    my $company    = qr/$alpha(&|\@)$alpha/;
    my $hostname   = qr/\w+(\.\w+)+/;
    my $email      = qr/\w+\@$hostname/;
    my $p          = qr/[_\/.,-]/;
    my $hasdigit   = qr/\w*\d\w*/;
    my $num        = qr/\w+$p$hasdigit|$hasdigit$p\w+
                       |\w+($p$hasdigit$p\w+)+
                       |$hasdigit($p\w+$p$hasdigit)+
                       |\w+$p$hasdigit($p\w+$p$hasdigit)+
                       |$hasdigit$p\w+($p$hasdigit$p\w+)+/x;

    =head2 token_re

    The regular expression for tokenising.

    =cut

    sub token_re {
        qr/
            $apostrophe | $acronym | $company | $hostname | $email | $num
            | \w+
        /x;
    }

(: That first comment is in the source. :)

For simplicity, KS has stayed away from offering that as a stock item, but you
can see how it would be useful.  

Also, it's not uncommon to see messages to the Lucene user's list from someone
wanting to know how to tweak StandardTokenizer for a specific problem domain.
Variants on StandardTokenizer are possible with a regex-based tokenizer -- but
not with a named Tokenizer subclass.

> Also, tasks like tokenizing Asian languages seems like they would be hard
> with just a regex.  

That's right.  But tokenizing Asian languages, particularly Japanese, is
frightfully difficult and complex -- so core KS isn't really the right place
for such a tokenizer, and it shouldn't be part of the file format.

> I'm thinking about it mostly on the search end, rather than the
> indexing end, with a flex-based tokenizer reading a file produced by
> another system.  Instead of being a different blessed Analyzer, it
> would just be another implementation capable of handling the same
> named tokenization scheme.

Yes, I can see how that would be handy. 

Nothing would stop you, though, from implementing such a named tokenizer.  You
just have to make sure that both implementations know how to handle it:

   analyzer: "My::Custom::Tokenizer"

That, as opposed to this:

   analyzer: 
      tokenizer:
         token_re: "\S+"

All I'm saying is that to fully support the invindex file format, you have to
support a small set of Analyzers.  (Not coincidentally, they're the ones that
are in KS right now.)

> > I'm tempted to write a formal spec called "ASHL" -- Array Scalar Hash
> > Language.
> 
> I don't think this is a good use of your time, unless this spec is
> explicitly written to define a subset of JSON or YAML.  

Agreed.  I don't really want that task.  I want the YAML people to define YAML
Level 1 so KS can use it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list