[KinoSearch] Serialized Schema

Nathan Kurz nate at verse.com
Sun Sep 30 14:56:21 PDT 2007



On 9/29/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> A single
> regex-based Tokenizer like the one we have now offers the greatest
> combination of flexibility, power, and simplicity of implementation.

Simple tokenizers work for the the things I want to do, but I'm not
sure a regex is really that generally useful.   How many useful
regular expressions are we talking about here?   Also, tasks like
tokenizing Asian languages seems like they would be hard with just a
regex.  There was someone who wrote to the list asking about doing
that a while ago.

> By the way, I suspect it was only a brain-hiccup on your part, but
> specifying a token_re of "\S+" is not the same as a split -- it's
> actually the inverse.

Indeed.  I partially reworded, didn't reread, and produced nonsense.

> I thought about this for a while.  Perl-compatible regular expression
> syntax is very widespread.

Agreed in retrospect.  If you are specifying a regex, this is a fine
way.  More generally, I'm fine with the path you suggest, I'm just not
sure the generality of the regex approach actually produces much gain.
 I have no objection to it in principle.

> Yes.  A flex-based tokenizer for a C implementation would be cool, it
> just wouldn't be part of the official list of blessed Analyzers.

I'm thinking about it mostly on the search end, rather than the
indexing end, with a flex-based tokenizer reading a file produced by
another system.  Instead of being a different blessed Analyzer, it
would just be another implementation capable of handling the same
named tokenization scheme.

> I'm tempted to write a formal spec called "ASHL" -- Array Scalar Hash
> Language.

I don't think this is a good use of your time, unless this spec is
explicitly written to define a subset of JSON or YAML.  Although you
will be happy writing your own ASHL interpreter, most others who would
be using your format for some other purpose would likely prefer to use
an existing interpreter.  If the point comes when some other
implementation using your file format gains sufficient popularity that
you need to support the full spec, count it as a success!

Nathan Kurz
nate at verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list