[KinoSearch] FieldSpec/InvIndexSpec API

Marvin Humphrey marvin at rectangular.com
Mon Nov 20 16:17:23 PST 2006


So, checking out what stuff would look like in YAML as opposed to XML...

Here's the .cfsmeta compound file description...

   ---
   seg_name: _1
   sub_files:
     -
       name: _1.tii
       offset: 0
       length: 440
     -
       name: _1.tis
       offset: 440
       length: 2709


Looks good.  Parsing the name value pairs is cake.  I have to wrap my  
head around how to keep track of the indentation level and where each  
data structure begins and ends, though.

Since there's no requirement that everything be housed within a root  
node -- unlike XML -- I haven't included one.

The delqueue file...

   ---
   files:
     - _10.cfs
     - _10.cfsmeta
     - _11.cfs
     - _11.cfsmeta


The lock file...

   ---
   invindex: /path/to/invindex


The per-segment .delmeta file...

   ---
   seg_name: _2
   num_deletions: 5
   byte_size: 1291


Those are all straightforward.  Let's consider the big one,  
invindex.meta...

   ---
   analyzer: 'CustomAnalyzer'
   fields:
     'title':
       number: 0
       spec:
         name: 'KinoSearch::Index::DefaultFieldSpec'
         arguments:
           boost: 1
           indexed: 1
           analyzed: 1
           stored: 1
           compressed: 0
           vectorized: 1
     'body':
       number: 1
       spec:
         name: 'KinoSearch::Index::DefaultFieldSpec'
         arguments:
           boost: 2
           indexed: 1
           analyzed: 1
           stored: 1
           compressed: 0
           vectorized: 1
     'url':
       number: 2
       spec:
         name: 'KinoSearch::Index::DefaultFieldSpec'
         arguments:
           boost: 1
           indexed: 1
           analyzed: 0
           stored: 1
           compressed: 0
           vectorized: 0
     'date':
       number: 3
       spec:
         name: 'CustomFieldSpec'


I like it.  While I was writing that, I was thinking ahead about just  
how the FieldSpec API should work -- I wasn't distracted by the  
challenge of visually parsing the data, as I would have been with  
XML.  It has the clarity of a straight-up name-value pair config file  
would, but it maps onto a multi-level data structure of hashes and  
arrays, and it's upwards-compatible with a reasonably popular spec.

I don't see any need for multi-line strings, or even double quoting  
with C-style escapes, let alone any of YAML's more esoteric  
extensions (like references).  It's more complicated to write the  
parser than it would have been for XML, but it's still doable.

Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list