[KinoSearch] get doc/query similarity

Marvin Humphrey marvin at rectangular.com
Tue Apr 15 13:11:11 PDT 2008




On Apr 15, 2008, at 12:19 PM, Nathan Kurz wrote:

> This is going to be a pretty expensive query, though, and depending on
> your usage patterns you might want to precompute these.  Depending on
> the overlap of your documents and how heavily you make use of
> stop-words, presume you may have to sift through about half your
> corpus, either from disk or memory depending on your situation.

Right.

Gory details: KS can't do super-effective primary key query  
optimization because of index data compression.  It doesn't know  
*where* in the big blob of postings data the bit relating to document  
primary_key_id=23412 lies. It's smart enough to stop scanning when no  
more docs can match, but it still has to scan each posting list up to  
that point.

> ps.  Marvin --- the term-by-term approach might be a useful general
> optimization for a special purpose additive OrScorer.

Yeah, term-at-a-time scoring is great stuff, it's just that the  
combining scorers in KS all need to go doc-at-a-time in order to  
handle boolean constraints without blowing up.

I've been thinking about adding new public classes ORQuery, ANDQuery,  
ANDNOTQuery and ANDORQuery.  BooleanQuery would either be deprecated  
or removed; the logic from the compilation phase of BooleanScorer's  
first iteration would be moved to QueryParser.

Historical note: the original BooleanScorer, used in KS maint, is not  
a wrapper around other combining scorers, it's an altogether different  
beast.  The SHOULD/MUST/MUST_NOT BooleanClause API fits that  
implementation, but is, uh, less boolean than the modern component- 
based approach allows.

Would that help with A) a term-at-a-time ORScorer, and B) your  
subclasses?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list