[KinoSearch] index compression schemes
Marvin Humphrey
marvin at rectangular.com
Mon Jun 30 20:22:47 PDT 2008
On Jun 30, 2008, at 12:52 PM, Nathan Kurz wrote:
> I was thinking about how one makes good use of multiple cores, and
> came across an interesting paper comparing inverted index compression
> schemes:
> http://www2008.org/papers/pdf/p387-zhangA.pdf
Good stuff! I hadn't read this particular paper, but the index
structures described at the front of the paper resemble KS closely and
it's highly relevant.
> I'm not up on this, but hadn't heard at all about the PForDelta scheme
> they extoll. You've probably seen it
I was not familiar with the specifics of the PForDelta scheme.
However, I have a good understanding of the problem that it is
designed to solve (compression of posting data).
The main thing to take away from this paper is the simple fact that
*competing formats exist*. Our goal must be to devise a robust and
efficient plugin format, one that would allow us to use PForDelta
compression, what the paper calls "VByte" (basically what KS uses now)
or any of the other coding strategies.
The KinoSearch::Posting abstract class currently encapsulates our
plugin format, but it has some limitations. My original plan was to
have a one-to-one relationship between Posting subclasses and index
formats, but that turns out to be insufficient, and the PForDelta algo
shows why.
PForDelta is only appropriate for batch processing:
Some of the methods, in particular PForDelta and our implementation
of Rice Coding, are not suitable for very short lists, as they
operate on multiples of 32 or 128 numbers. To address this issue,
we
always used variable-byte coding for any inverted list with fewer
than 100 postings, while for longer lists we padded the inverted
lists with dummy entries as needed by the various compression
methods.
However, Posting was not designed to maintain state well enough to
batch process -- the writing method just took several arguments
describing the last posting and the current posting in the loop.
We're really going to need a dedicated PostingEncoder class to handle
something like that. And then probably we will need a dedicated
PostingDecoder class as well for search-time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list