[KinoSearch] get doc/query similarity
Marvin Humphrey
marvin at rectangular.com
Mon Apr 28 14:11:44 PDT 2008
On Apr 27, 2008, at 1:49 PM, jack_tanner at yahoo.com wrote:
> What seems to me to be a good idea is a method of retrieving *one
> specific* doc from the index, given some key, that is relatively
> cheaper than doing a search.
Here's the slightly more optimized version -- which, unlike the
earlier version, will work with a Searcher but not a MultiSearcher.
Also, it returns the first hit rather than the highest scoring hit.
package MySearcher;
use base qw( KinoSearch::Searcher );
sub single_hit {
my ( $self, $key ) = @_;
my $reader = $self->get_reader;
my $posting_list = $reader->posting_list(
field => 'pri_key_field',
term => $key,
);
return unless $posting_list;
my $doc_num = $posting_list->next;
return unless $doc_num;
return $reader->fetch_doc($doc_num);
}
If the term is unique, I doubt that the there will be significant
performance differences between the two implementations. The i/o
costs are exactly the same -- the searching technique just involves a
few more wrappers.
The more important difference is that you can get a PostingList from
an IndexReader, but not from a Searcher. PostingLists, like
IndexReaders, are tied to individual machines. Searcher is a single-
machine subclass of Searchable; and Searchable's interface is designed
to work with document collections which may or may not reside on a
single machine.
I realize that sounds kind of complicated, but the point is that we
have to balance two competing design goals with Searcher. On one hand,
it's a single entrance point in a large, flexible retrieval API -- and
it has to remain a responsible role player within the larger system.
On the other hand, it's the most popular entrance point, so it has to
be easy to use without needing to grok the whole system.
> Suppose we could extend that by telling KS that we indeed have an
> exterior mechanism of guaranteeing the uniqueness of pri_key_field.
> It could then stop searching the moment it gets the first hit, and
> return that.
As soon as the Searcher exhausts the posting list, scoring stops. So
the cost to score all hits actually depends on whether you
successfully enforce uniqueness from outside. :)
> Moreover, KS could have a special-cased optimization for searching
> that field, perhaps requireing some syntax restrictions on the value
> (numeric only? single token only?).
You'll need to set the field to be unanalyzed; otherwise you have add
an Analyzer into the mix.
> By the way, there seems to be a related mechanism for $invindexer-
> >delete_docs_by_term().
InvIndexer is always operating on a single machine.
Hmm. Maybe InvIndexer should be a subclass of a more general abstract
class defining an indexing interface which isn't necessarily tied to
one machine -- i.e. an index-time counterpart to Searchable.
>> First, we need a way of obtaining document numbers from a search.
>> The
>> easiest way to make this happen is to expose get_doc_num for HitDoc.
>> (There are other places as well, that's just the easiest and it would
>> work for our purposes.)
>
> But you just said you don't want to expose internal KS doc numbers?
I've been trying to keep them hidden, and we've gotten pretty far
without them.
But it was inevitable that doc numbers would get exposed eventually,
because they are needed by many of the low-level, expert components
we're starting to expose: PostingList, HitCollector, etc.
The downside of exposing KinoSearch's document numbers is that their
behavior isn't intuitive: they're consistent for the life of a
Searchable or IndexReader, but outside of that, they sometimes change.
>> * Should we store any other information besides the terms and
>> their positions, start_offsets and end_offsets?
>
> You probably want to make sure that term frequencies are accessible,
Good thought. They're already in there, but I had neglected to
mention it.
> as well as left/right positional context.
I think you're just using different names for what I'm calling
"start_offset" and "end_offset", which are measured in Unicode code
points from the top of the field.
> One similarity metric that's useful to compute is doc-doc similarity
> over token or character n-grams. How would one do that in our brave
> new world?
I think you'd want to actually index the n-grams by specifying some
sort of n-gram Analyzer subclass, then use the same techniques we've
been discussing.
> Something that may be useful is a toggle to normalize the returned
> scores...
They did that with the Lucene version of Hits and it seems to have
resulted in a lot of confusion. I think it's misleading, because it
gives the illusion of an absolute similarity measure but not the
reality.
You can always normalize scores yourself off of the score of the first
HitDoc returned by the Hits object. (Though that technique doesn't
work with sorted results because the first hit probably isn't the
highest scoring.)
> Can one do pseudo-relevance feedback using KS? That is, run a
> search, get some hits, then use the hits as the terms for a new
> search. Optionally, loop over the hits and exclude unwanted docs
> before executing the new search.
You can do that, but I've never been impressed with how "more like
this" queries behave using vanilla TF-IDF. Rare terms in the feedback
documents which are unrelated to the original query -- especially
proper names -- tend to spawn high-scoring irrelevant results.
To make such a feedback system work well, I think you'd need be
working with a decomposed term space a la LSA and apply the term space
of the original query as a filter on the term space of each feedback
doc.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list