[KinoSearch] get doc/query similarity

Marvin Humphrey marvin at rectangular.com
Mon Apr 28 14:11:44 PDT 2008




On Apr 27, 2008, at 1:49 PM, jack_tanner at yahoo.com wrote:

> What seems to me to be a good idea is a method of retrieving *one  
> specific* doc from the index, given some key, that is relatively  
> cheaper than doing a search.

Here's the slightly more optimized version -- which, unlike the  
earlier version, will work with a Searcher but not a MultiSearcher.   
Also, it returns the first hit rather than the highest scoring hit.

    package MySearcher;
    use base qw( KinoSearch::Searcher );

    sub single_hit {
       my ( $self, $key ) = @_;
       my $reader = $self->get_reader;
       my $posting_list = $reader->posting_list(
          field => 'pri_key_field',
          term  => $key,
       );
       return unless $posting_list;
       my $doc_num = $posting_list->next;
       return unless $doc_num;
       return $reader->fetch_doc($doc_num);
    }

If the term is unique, I doubt that the there will be significant  
performance differences between the two implementations.  The i/o  
costs are exactly the same -- the searching technique just involves a  
few more wrappers.

The more important difference is that you can get a PostingList from  
an IndexReader, but not from a Searcher. PostingLists, like  
IndexReaders, are tied to individual machines.  Searcher is a single- 
machine subclass of Searchable; and Searchable's interface is designed  
to work with document collections which may or may not reside on a  
single machine.

I realize that sounds kind of complicated, but the point is that we  
have to balance two competing design goals with Searcher. On one hand,  
it's a single entrance point in a large, flexible retrieval API -- and  
it has to remain a responsible role player within the larger system.   
On the other hand, it's the most popular entrance point, so it has to  
be easy to use without needing to grok the whole system.

> Suppose we could extend that by telling KS that we indeed have an  
> exterior mechanism of guaranteeing the uniqueness of pri_key_field.  
> It could then stop searching the moment it gets the first hit, and  
> return that.

As soon as the Searcher exhausts the posting list, scoring stops.  So  
the cost to score all hits actually depends on whether you  
successfully enforce uniqueness from outside.  :)

> Moreover, KS could have a special-cased optimization for searching  
> that field, perhaps requireing some syntax restrictions on the value  
> (numeric only? single token only?).

You'll need to set the field to be unanalyzed; otherwise you have add  
an Analyzer into the mix.

> By the way, there seems to be a related mechanism for $invindexer- 
> >delete_docs_by_term().

InvIndexer is always operating on a single machine.

Hmm.  Maybe InvIndexer should be a subclass of a more general abstract  
class defining an indexing interface which isn't necessarily tied to  
one machine -- i.e. an index-time counterpart to Searchable.

>> First, we need a way of obtaining document numbers from a search.   
>> The
>> easiest way to make this happen is to expose get_doc_num for HitDoc.
>> (There are other places as well, that's just the easiest and it would
>> work for our purposes.)
>
> But you just said you don't want to expose internal KS doc numbers?

I've been trying to keep them hidden, and we've gotten pretty far  
without them.

But it was inevitable that doc numbers would get exposed eventually,  
because they are needed by many of the low-level, expert components  
we're starting to expose: PostingList, HitCollector, etc.

The downside of exposing KinoSearch's document numbers is that their  
behavior isn't intuitive: they're consistent for the life of a  
Searchable or IndexReader, but outside of that, they sometimes change.

>>   * Should we store any other information besides the terms and
>>     their positions, start_offsets and end_offsets?
>
> You probably want to make sure that term frequencies are accessible,

Good thought.  They're already in there, but I had neglected to  
mention it.

> as well as left/right positional context.

I think you're just using different names for what I'm calling  
"start_offset" and "end_offset", which are measured in Unicode code  
points from the top of the field.

> One similarity metric that's useful to compute is doc-doc similarity  
> over token or character n-grams.  How would one do that in our brave  
> new world?

I think you'd want to actually index the n-grams by specifying some  
sort of n-gram Analyzer subclass, then use the same techniques we've  
been discussing.

> Something that may be useful is a toggle to normalize the returned  
> scores...


They did that with the Lucene version of Hits and it seems to have  
resulted in a lot of confusion.  I think it's misleading, because it  
gives the illusion of an absolute similarity measure but not the  
reality.

You can always normalize scores yourself off of the score of the first  
HitDoc returned by the Hits object.  (Though that technique doesn't  
work with sorted results because the first hit probably isn't the  
highest scoring.)

> Can one do pseudo-relevance feedback using KS? That is, run a  
> search, get some hits, then use the hits as the terms for a new  
> search. Optionally, loop over the hits and exclude unwanted docs  
> before executing the new search.


You can do that, but I've never been impressed with how "more like  
this" queries behave using vanilla TF-IDF.  Rare terms in the feedback  
documents which are unrelated to the original query -- especially  
proper names -- tend to spawn high-scoring irrelevant results.

To make such a feedback system work well, I think you'd need be  
working with a decomposed term space a la LSA and apply the term space  
of the original query as a filter on the term space of each feedback  
doc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list