[KinoSearch] get doc/query similarity

Marvin Humphrey marvin at rectangular.com
Wed Apr 23 21:06:39 PDT 2008




On Apr 18, 2008, at 6:22 AM, jack_tanner at yahoo.com wrote:

> Right. How about something like this:
>
> $doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1);
> $doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2);
>
> I like that this gets the doc from the invindex rather than a  
> searcher.

InvIndex is a low-level class.  (FYI, it's actually something  
different in maint and devel, but in both cases it's low-level).   
KinoSearch::Index::IndexReader, which has a private fetch_doc()  
method, more closely resembles what you're looking for.

   # private method
   my $doc = $reader->fetch_doc($doc_num);

Searchable also specs a fetch_doc() method which is implemented in  
Searcher as a call to $self->{reader}->fetch_doc().

However, those fetch_doc() methods operate on KinoSearch's internal  
document numbers. KS document numbers aren't presently a part of  
official API, and they change over time, which makes them both  
confusing and of limited use.

What you're talking about is adding some sort of retrieve-by-primary- 
key facility to KS.

> It makes clear that it returns a doc, not a hit.

FYI, in devel, $hits->fetch_hit() returns a HitDoc, which is a  
subclass of Doc.

> It either succeeds (we get *the* doc, not any other doc), or fails.

Well, this is really database territory, which isn't KinoSearch's  
element.  Adding primary key constraints is something that could  
potentially be done via a KSx subclass, but it would be very awkward  
in core.

You can fake a retrieval by primary key something like this:

    package MySearcher;
    use base qw( KinoSearch::Searcher );

    sub im_feeling_lucky {
       my ( $self, $key ) = @_;
       my $termquery = KinoSearch::Search::TermQuery->new(
          field => 'pri_key_field',
          term  => $key,
       );
       my $hits = $searcher->search( query => $term_query, num_wanted  
=> 1 );
       return $hits->fetch_hit;
    }

That will work if you have your own exterior mechanism for  
guaranteeing the uniqueness of a particular field during indexing.

> $similarity = $doc1->get_cosine($doc2);
>
> And more generally,
>
> $similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);

Interesting.  Similarity measures are implemented using pluggable  
classes in KinoSearch, which suggests this...

   my $sim   = KinoSearch::Search::Similarity->new;
   my $score = $sim->cosine( $doc1, $doc2 );

Doc objects are just collections of stored fields, though.  They have  
no idea what terms they contain.  They have no idea how they're  
parsed, and a Similarity object wouldn't have any idea how to parse  
them either.

But here's where some fruitful possibilities arise.

Currently, KinoSearch writes a part of the index called "term  
vectors", for which Highlighter is the primary consumer.  The term  
vector information consists of lists of the terms present in each  
field, along with frequency, positions, start_offsets, and  
end_offsets.  KS accesses this information like so:

    # Fetch a DocVector object, from which TermVector objects may be  
extracted.
    my $doc_vec = $searcher->fetch_doc_vec($doc_num);

The following cosine() method could theoretically work, because at  
least all the information that's needed is present:

    my $score = $sim->cosine( $doc_vec1, $doc_vec2 );

However, we'd need to expose a few more public APIs.

First, we need a way of obtaining document numbers from a search.  The  
easiest way to make this happen is to expose get_doc_num for HitDoc.   
(There are other places as well, that's just the easiest and it would  
work for our purposes.)

Second we need to expose DocVector, or rather, an improvement upon  
DocVector because DocVector isn't ready for prime-time.

What Highlighter and you really need is a pre-analyzed document.   
(Highlighter could actually work by analyzing fields on the fly --  
indeed, Lucene's highlighter can be set up that way -- except for the  
fact that analyzing on the fly can be unacceptably slow for large  
documents or costly analyzers.)  The questions are...

   * What's a better name than DocVector?  AnalyzedDoc?
   * Should we store any other information besides the terms and
     their positions, start_offsets and end_offsets?
   * How should the data file be formatted?

This is something I really want to nail in the file format, because  
that's the hardest thing to change.

> - KS retrieval is asymmetrical (and that's fine). Let  
> similarity(I,A,B) be a function that specifies document A as query  
> against index I, iterates over the hits until it gets to document B,  
> and returns the score of document B. Then similarity(I,A,B) !=  
> similarity(I,B,A). I handled this by retrieving both  
> similarity(I,A,B) and similarity(I,B,A) and taking the average.
>
> - One issue that still puzzles me is that KS is apparently capable  
> of a hit score greater than 1! Is that really true?

Yeah, absolutely.  It's the same way with Lucene, and KS scoring is  
directly derived from the Lucene scoring model.  Lucene and KS only  
care about coarse relative ranking, so there are some adulterations  
and approximations in the similarity calculations.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list