[KinoSearch] Kino uses KinoSearch with Google n-grams...

Marvin Humphrey marvin at rectangular.com
Thu Mar 15 08:41:32 PST 2007


On Mar 8, 2007, at 7:47 PM, Kino Coursey wrote:
> Seems I and the program are named after the same character from the  
> same book!
Guess I'm not the only one it made a deep impression upon.  :)
> It’s the +3,793 Million documents
That number of small documents shouldn't pose a problem.  The number  
of terms may be where we're running up against something.  In theory,  
the maximum number of terms per-index should be somewhere just shy of  
2**31, as they are tracked using a signed 32-bit integer.  However,  
there may be bottlenecks somewhere else I didn't think about.  Maybe  
there's some term-number arithmetic that wraps somewhere.

The way to hunt this down is to design an algorithm specifically for  
maxing out unique terms and see where it chokes.

[ ... investigates ... ]

Found one bottleneck.  The loop iteration variable in  
PostingsWriter's big finishing loop is a 32-bit integer.  It  
definitely ought to be a 64-bit integer, because it increments once  
for each posting list (one term, one doc, multiple positions).  That  
really needs to get fixed; however, it ought to result in the  
exclusion of high sorting terms (higher field number and term text  
closer to 'z'), rather than cause a segfault.
> I can run the query when the index is in the “S”’s just fine, but  
> when I add the rest and finish : seg fault.
If you are on Linux and can spare the cycles, it would be interesting  
to see what Valgrind has to say about this seg fault.
> One unusual thing that did happen during the indexing was a power  
> failure.
I doubt that affected things.  KinoSearch's indexing is robust in the  
face of crashes.  There's a moment when new data is committed via the  
renaming of a file; if the indexing process stops before that,  
there's no change.
> Maybe a job for 0.20 ?
I would like to get this sorted before the official release of 0.20.   
If, for some reason, accommodating a large number of terms (assuming  
that is the issue) requires a backwards-incompatible change, I'd like  
to bundle that change with all the others.  I doubt that will be the  
case, though.  The architecture is derived from Lucene's, which has  
been used to handle indexes in excess of 100 million documents (190  
million is the largest I recall having heard about).

> Also one other option I am looking at is building the indexes in  
> parallel and merging them into a unified index. I know it’s  
> possible but will it be happy with the sizes I have to deal with?
>
> And in general are there recommended and absolute limits on the  
> index size?
The architecture ought to withstand several million or possibly tens  
of millions of docs on a single machine, depending on document size  
and required response time.  After that, it will be necessary to  
spread out the index over multiple machines and combine search  
results using MultiSearcher and SearchServer/SearchClient.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list