[KinoSearch] get doc/query similarity
Marvin Humphrey
marvin at rectangular.com
Tue Apr 15 13:11:11 PDT 2008
On Apr 15, 2008, at 12:19 PM, Nathan Kurz wrote:
> This is going to be a pretty expensive query, though, and depending on
> your usage patterns you might want to precompute these. Depending on
> the overlap of your documents and how heavily you make use of
> stop-words, presume you may have to sift through about half your
> corpus, either from disk or memory depending on your situation.
Right.
Gory details: KS can't do super-effective primary key query
optimization because of index data compression. It doesn't know
*where* in the big blob of postings data the bit relating to document
primary_key_id=23412 lies. It's smart enough to stop scanning when no
more docs can match, but it still has to scan each posting list up to
that point.
> ps. Marvin --- the term-by-term approach might be a useful general
> optimization for a special purpose additive OrScorer.
Yeah, term-at-a-time scoring is great stuff, it's just that the
combining scorers in KS all need to go doc-at-a-time in order to
handle boolean constraints without blowing up.
I've been thinking about adding new public classes ORQuery, ANDQuery,
ANDNOTQuery and ANDORQuery. BooleanQuery would either be deprecated
or removed; the logic from the compilation phase of BooleanScorer's
first iteration would be moved to QueryParser.
Historical note: the original BooleanScorer, used in KS maint, is not
a wrapper around other combining scorers, it's an altogether different
beast. The SHOULD/MUST/MUST_NOT BooleanClause API fits that
implementation, but is, uh, less boolean than the modern component-
based approach allows.
Would that help with A) a term-at-a-time ORScorer, and B) your
subclasses?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list