[KinoSearch] more abstract interfaces to kinosearch
Hans Dieter Pearcey
hdp at pobox.com
Mon Jul 2 14:22:04 PDT 2007
On Mon, Jul 02, 2007 at 01:35:45PM -0700, Marvin Humphrey wrote:
> >Is this true even when (like me) you are only interested in matching?
>
> In theory the (unfinished) MatchPosting class is supposed to help out
> with situations like yours. However, because it doesn't store token
> position, it doesn't support phrase matching, and maybe it needs to
> be rethought.
I *think* this meant "no, SHOULD won't help you", but I can't tell. Is that
true? :)
> No, there's no reason. I guess I thought the capabilities were
> implied by the class name. Looks like usability testing has revealed
> a flaw! ;)
BooleanQuery implies OR, but SHOULD doesn't (to me). I found this fairly
frustrating specifically *because* the class name implied that I could use it to
do OR, but I can't, because I don't want to sort by score.
> There are a lot of good databases out there. KS shouldn't aspire to compete
> with PostgreSQL.
No, and my use case is actually augmenting PostgreSQL. Pg (at least with the
hardware I have) doesn't search people's email fast enough. It is good at
things like referential integrity and constraints on the data. It isn't good
at looking through 250G of mail in under a second.
> KinoSearch is always going to be optimized for the use case of a
> large number of queries against a single view of an index.
I don't think I'm asking that it be anything else, and I think that is a
perfectly sensible optimization to make.
> I don't think we'll have to make a choice between matching alone and
> matching with scoring, though. It should be possible to support both
> without compromise.
That's sort of my point. Right now, it seems difficult to me to get matching
alone, because the documentation isn't there to explain to me how to get the
various different behaviors I want.
When I asked about RangeFilter and how to get the same behavior, but as a
Query, you said "Oh, you could plug X and Y and Z together, but that wouldn't
get you scoring". What I took away from that was that the pieces were there to
do what I wanted, but I didn't know it because a) I don't know anything about
IR theory and b) the documentation didn't tell me so, because it seems to
assume that you want scoring. Does that seem unfair?
Here is how I (naively) think the classes might make more sense:
* A class per document selector (term, phrase, range, boolean, ...)
* A 'Query' class combining selector + whatever makes up Query behavior
(scoring?)
* A 'Filter' class combining selector + whatever makes up Filter behavior (bit
vector caching?)
* explanations in specific selectors about why they can or can't be used for
either Query or Filter: Range can't be used with Query because there's no way
to know how it should contribute to scoring
* documentation on how to write new selectors, and on what things make a
particular selector suitable for Query or Filter
I can try to explain this better later when I'm not late for dinner. :)
hdp.
More information about the kinosearch
mailing list