[KinoSearch] FieldSpec/InvIndexSpec API

Marvin Humphrey marvin at rectangular.com
Mon Nov 20 13:07:55 PST 2006


On Nov 20, 2006, at 12:51 AM, Tony Bowden wrote:

> Peter Karman wrote:
>>> For now, all we need is a way to convey small amounts of data in  
>>> a tree structure.  I think if we limit ourselves to a strict  
>>> subset of XML, writing our own C parser will be simple enough.   
>>> Here's a starting set of constraints:
>>>
>>>   * No attributes.
>>>   * ASCII-only.
>>>   * No escapes.
>>>
>>> Basically, nothing except for paired tags indicating node name,  
>>> with a text value and optional child nodes.
>> yes, given the XML you've been describing, it makes sense to go  
>> with a very lightweight parser.
>> Glad to see you're going the XML route for now; writing the parser  
>> aside, I think it'll make life easier having a human-readable meta  
>> format.
> Perhaps I've missed this in earlier discussion, but if you're not  
> actually going to use XML, and are going to roll your own parser,  
> why use an XML-like format? On top of the confusion from users who  
> think that it is actually XML, wouldn't it be easier to create a  
> format that's both easier for a human to read and easier to parse?  
> (e.g. something more YAML-like)

I think the "no escapes" bullet point I made above may have been  
misleading.  My intent was that the files would be legal XML.  Future  
versions could use an XML parser of whatever strength to parse the  
file and it wouldn't choke before being able to at least read a  
version number and decide whether or not it could read the whole  
invindex.

As for it being easy to read, you're right that YAML is easier.   
However, more people have greater familiarity with XML, including  
me :)  and my guess is that the majority of hackers would acquire a  
sense of mastery over XML-formatted data faster.  I think those two  
factors offset each other.

As for ease of parsing, the equation is screwy because I'm talking  
about writing the parser in C -- so it's a PITA no matter what.  An  
internally consistent mini-XML parser with the above constraints is  
not hard, comparatively speaking.  There's only one kind of  
delimiter, which is handy.

However, the nice advantage of YAML is that it maps onto native data  
structures cleanly.  KS now has its own lightweight hash, variable- 
sized array, and string constructs, so the parser could read directly  
into those and we wouldn't need an XMLNode class.  It's worth further  
exploration.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list