[KinoSearch] Kino uses KinoSearch with Google n-grams...
Marvin Humphrey
marvin at rectangular.com
Thu Mar 15 08:41:32 PST 2007
On Mar 8, 2007, at 7:47 PM, Kino Coursey wrote:
> Seems I and the program are named after the same character from the
> same book!
Guess I'm not the only one it made a deep impression upon. :)
> It’s the +3,793 Million documents
That number of small documents shouldn't pose a problem. The number
of terms may be where we're running up against something. In theory,
the maximum number of terms per-index should be somewhere just shy of
2**31, as they are tracked using a signed 32-bit integer. However,
there may be bottlenecks somewhere else I didn't think about. Maybe
there's some term-number arithmetic that wraps somewhere.
The way to hunt this down is to design an algorithm specifically for
maxing out unique terms and see where it chokes.
[ ... investigates ... ]
Found one bottleneck. The loop iteration variable in
PostingsWriter's big finishing loop is a 32-bit integer. It
definitely ought to be a 64-bit integer, because it increments once
for each posting list (one term, one doc, multiple positions). That
really needs to get fixed; however, it ought to result in the
exclusion of high sorting terms (higher field number and term text
closer to 'z'), rather than cause a segfault.
> I can run the query when the index is in the “S”’s just fine, but
> when I add the rest and finish : seg fault.
If you are on Linux and can spare the cycles, it would be interesting
to see what Valgrind has to say about this seg fault.
> One unusual thing that did happen during the indexing was a power
> failure.
I doubt that affected things. KinoSearch's indexing is robust in the
face of crashes. There's a moment when new data is committed via the
renaming of a file; if the indexing process stops before that,
there's no change.
> Maybe a job for 0.20 ?
I would like to get this sorted before the official release of 0.20.
If, for some reason, accommodating a large number of terms (assuming
that is the issue) requires a backwards-incompatible change, I'd like
to bundle that change with all the others. I doubt that will be the
case, though. The architecture is derived from Lucene's, which has
been used to handle indexes in excess of 100 million documents (190
million is the largest I recall having heard about).
> Also one other option I am looking at is building the indexes in
> parallel and merging them into a unified index. I know it’s
> possible but will it be happy with the sizes I have to deal with?
>
> And in general are there recommended and absolute limits on the
> index size?
The architecture ought to withstand several million or possibly tens
of millions of docs on a single machine, depending on document size
and required response time. After that, it will be necessary to
spread out the index over multiple machines and combine search
results using MultiSearcher and SearchServer/SearchClient.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list