[KinoSearch] get doc/query similarity
jack_tanner at yahoo.com
jack_tanner at yahoo.com
Sun Apr 27 13:49:15 PDT 2008
----- Original Message ----
> From: Marvin Humphrey <marvin at rectangular.com>
[snip discussion of retrieval of a doc via IndexReader or a Searcher]
OK, I grant that KS doesn't want to be an RDB. What seems to me to be a good idea is a method of retrieving *one specific* doc from the index, given some key, that is relatively cheaper than doing a search.
> You can fake a retrieval by primary key something like this:
>
> package MySearcher;
> use base qw( KinoSearch::Searcher );
>
> sub im_feeling_lucky {
> my ( $self, $key ) = @_;
> my $termquery = KinoSearch::Search::TermQuery->new(
> field => 'pri_key_field',
> term => $key,
> );
> my $hits = $searcher->search( query => $term_query, num_wanted
> => 1 );
> return $hits->fetch_hit;
> }
>
> That will work if you have your own exterior mechanism for
> guaranteeing the uniqueness of a particular field during indexing.
There's nothing wrong with that API. Suppose we could extend that by telling KS that we indeed have an exterior mechanism of guaranteeing the uniqueness of pri_key_field. It could then stop searching the moment it gets the first hit, and return that. If we fail in our uniqueness guarantee, well, it still only returns one hit, and we can't know deterministically which one.
Moreover, KS could have a special-cased optimization for searching that field, perhaps requireing some syntax restrictions on the value (numeric only? single token only?).
By the way, there seems to be a related mechanism for $invindexer->delete_docs_by_term().
> First, we need a way of obtaining document numbers from a search. The
> easiest way to make this happen is to expose get_doc_num for HitDoc.
> (There are other places as well, that's just the easiest and it would
> work for our purposes.)
But you just said you don't want to expose internal KS doc numbers?
> * What's a better name than DocVector? AnalyzedDoc?
Yes.
> * Should we store any other information besides the terms and
> their positions, start_offsets and end_offsets?
You probably want to make sure that term frequencies are accessible, as well as left/right positional context. I'm not making claims about how these should be stored, just that they're accessible efficiently.
One similarity metric that's useful to compute is doc-doc similarity over token or character n-grams. How would one do that in our brave new world?
> * How should the data file be formatted?
That's a bit beyond my reach.
> Yeah, absolutely. It's the same way with Lucene, and KS scoring is
> directly derived from the Lucene scoring model. Lucene and KS only
> care about coarse relative ranking, so there are some adulterations
> and approximations in the similarity calculations.
Something that may be useful is a toggle to normalize the returned scores...
Can one do pseudo-relevance feedback using KS? That is, run a search, get some hits, then use the hits as the terms for a new search. Optionally, loop over the hits and exclude unwanted docs before executing the new search.
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list