[KinoSearch] getting back to mmap
Marvin Humphrey
marvin at rectangular.com
Thu May 1 22:43:08 PDT 2008
On Apr 29, 2008, at 12:57 PM, Nathan Kurz wrote:
>>> We've been off the mmap theme for a while, but I think it's still
>>> very
>>> relevant to KinoSearch.
I agree that the potential is there. Some restrictions intentionally
built into KS should make things easier:
1) All files used at search-time are guaranteed to be read-only.
2) OutStreams cannot seek; files are always written linearly from
top to tail.
> To my recollection, portability to Windows is the only real problem
> with this approach.
I've been doing a little research and I think we'll be able to emulate
what we need to. Even if we can't, the current implementation will
continue to work just fine.
> On Linux (and I presume OS X) mmap already
> underlies the existing file system. All one is doing is stripping out
> some unnecessary copies and duplication of effort between KinoSearch
> and the system.
If you didn't already know, you may be amused to hear that KS still
uses standard C buffered IO (FILE*, fread, fwrite) because it's the
lowest common denominator for portability -- so there are presumably
multiple copies going on. :) However, at least
KinoSearch::Store::FSFileDes uses setvbuf() to turn off buffering for
output. Strangely, I wasn't able to turn off input buffering using
setvbuf() because doing so had a significant negative impact on the
indexing performance benchmark across multiple operating systems.
FWIW, a version of FSFileDes which used unbuffered IO (open(), read(),
and write()) only purchased an improvement of a couple percent.
Probably we'll finish an mmap version first and that patch will never
be applied.
I think the real gains are likely to come from tweaking how we do
caching at search time.
> The goal would be to have C structures with
> elements pointing directly to the system buffers, and to let the
> system handle all the paging and buffering issues.
I suspect we will still want to encapsulate IO functions within
InStream and OutStream... We'll still have code that looks like this:
u32_t doc_num = InStream_Read_C32(instream);
However, InStream will "refill" its "buffer" using mmap/munmap rather
than by reading 1k worth of data at a shot from a file via fread() or
read(). In other words, it will be InStream's buffer that is the C
structure element pointing directly to the system buffer as you
describe above.
> One wouldn't be to change over to using all mmap'ed IO, rather just
> design the file structures and internal API's so it is possible to use
> them in a more efficient fashion.
If I recall correctly, you were talking about having higher level
objects such as Posting hold pointers to the system buffers. The
thing is, the on-disk files use a lot of compression, so you can't
read them as e.g. arrays of u32_t.
But what if we tried? Say you were implementing the simplest kind of
posting list, which is just a list of document numbers. You could
theoretically write uncompressed 32-bit document numbers to disk, mmap
the file to a "doc_nums" array member within "MMapPosting" and then
implement it along these lines:
MMapPosting_get_doc_num(MMapPosting *self)
{
return self->doc_nums[self->tick];
}
However, if you did that, it would increase the file size and i/o
requirements by 2-4x. I don't think that's likely to yield a net gain.
>> You may be interested in an ongoing dialog between Mike McCandless
>> and
>> myself on java-dev at lucene.apache.org about PostingList and the
>> postings file
>> format. There's some stuff in there about phrase scorers, too. In
>> addition
>> to many other contributions to Lucene such as the lockless-commits
>> file
>> format innovation, Mike's applied a bunch of concepts from KS.
>> http://www.nabble.com/Pooling-of-posting-objects-in-DocumentsWriter-tt16565743.html#a16596031
>
> Thanks! I only read through it quickly, but there are a lot of good
> ideas there (most of which flew over my head).
>
> Given my comments above about working with the file system, a few of
> the parts about bulk reads, buffered writes, and disk seeks made me
> cringe a little.
Well, did you at least notice that we were designing the file format
with SSDs in mind when you were scoring our discussion for buzzword
compliance? ;)
> As the architecture article mentioned, 'flushing'
> to 'disk' when your 'memory' is full might not work as well as one
> hopes.
Flushing to disk was discussed in the context of the external sorter,
used during indexing. For the external sorter, "flush" doesn't mean
just "write buffer to disk", it means "sort objects in cache, then
write their data to disk." Since it's not a simple write op, we don't
have the option of handing things off to the kernel.
Also, it would be unusual for an indexing application to behave like
squid. Indexers tend to be either active and dominating the computer,
or not running. Swapping out of pages because the indexing app has
been idle for a spell, well... that's just not a common enough use
case that we need to worry about optimizing for it, IMO.
> I'd guess that one could do something simpler but just as
> efficient by just calling 'write' each time, and letting the system
> decide when to commit the least recently used page to the physical
> disk.
Mike and I weren't discussing OutStream, but it's true that the each
OutStream object maintains its own 1k buffer. It would be nice to
ditch that, because there are clearly some useless copy ops -- but I
anticipate achieving only minor, incremental gains for that trouble.
What would be really great is if we could optimize term dictionaries
and sort caches for mmap.
> Unfortunately, I won't be able to back
> that up with code any time soon. :(
Well, that just means both communiques and code will emerge from me
more slowly.
Nevertheless, I've been pleased with how our design discussions have
gone. It was your insight that just solved the multi-field query
parsing problem which has been vexing me for a long time, Compiler
finally has a half-decent interface, and I feel pretty confident that
ANDQuery, ORQuery, etc will code up well. Let's keep things going.
I just wish we could get you, Dave Balmain, Mike McCandless and myself
all together hashing out a file format in the same forum at the same
time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list