[KinoSearch] index compression schemes

Marvin Humphrey marvin at rectangular.com
Mon Jun 30 20:22:47 PDT 2008


On Jun 30, 2008, at 12:52 PM, Nathan Kurz wrote:

> I was thinking about how one makes good use of multiple cores, and
> came across an interesting paper comparing inverted index compression
> schemes:
> http://www2008.org/papers/pdf/p387-zhangA.pdf

Good stuff!  I hadn't read this particular paper, but the index  
structures described at the front of the paper resemble KS closely and  
it's highly relevant.

> I'm not up on this, but hadn't heard at all about the PForDelta scheme
> they extoll.  You've probably seen it

I was not familiar with the specifics of the PForDelta scheme.   
However, I have a good understanding of the problem that it is  
designed to solve (compression of posting data).

The main thing to take away from this paper is the simple fact that  
*competing formats exist*.  Our goal must be to devise a robust and  
efficient plugin format, one that would allow us to use PForDelta  
compression, what the paper calls "VByte" (basically what KS uses now)  
or any of the other coding strategies.

The KinoSearch::Posting abstract class currently encapsulates our  
plugin format, but it has some limitations.  My original plan was to  
have a one-to-one relationship between Posting subclasses and index  
formats, but that turns out to be insufficient, and the PForDelta algo  
shows why.

PForDelta is only appropriate for batch processing:

     Some of the methods, in particular PForDelta and our implementation
     of Rice Coding, are not suitable for very short lists, as they
     operate on multiples of 32 or 128 numbers. To address this issue,  
we
     always used variable-byte coding for any inverted list with fewer
     than 100 postings, while for longer lists we padded the inverted
     lists with dummy entries as needed by the various compression
     methods.

However, Posting was not designed to maintain state well enough to  
batch process -- the writing method just took several arguments  
describing the last posting and the current posting in the loop.   
We're really going to need a dedicated PostingEncoder class to handle  
something like that.  And then probably we will need a dedicated  
PostingDecoder class as well for search-time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/






More information about the KinoSearch mailing list