[KinoSearch] rfc: faceted search

Marvin Humphrey marvin at rectangular.com
Wed Jul 30 12:38:46 PDT 2008




On Jul 30, 2008, at 11:03 AM, Nathan Kurz wrote:

> The bit vectors become
> unworkable once you have a large number of facets (think of using
> 'author' as a facet on a large dataset),

If you anticipate needing a large number of facets with small data  
sets, that's where you don't use BitVectors any more.  Instead you use  
a short array of sorted integers, possibly compressed -- I think Solr  
uses a class called SortedVIntList for this.  If we added such a  
feature, we'd need to create methods to and() and or() BitVectors  
against these integer lists instead of against other BitVectors.

That doesn't solve all our problems, but it does address the issue of  
sparse BitVectors using too much memory.

> and the wrapped query
> approach doesn't spread well to a cluster (as one would be forced to
> round-trip the full list of matching docs over the net and not just
> the top-n).

You never need to communicate the full list of matching docs across  
the network.  All you need to communicate is the top-n matching docs  
and the facet counts, which isn't a problem.

How to pass those facet counts, I'm not exactly sure.  We might need  
to extend TopDocs or something like that.  Since Solr sits on top of  
Lucene, it's must have its own Searcher-style class which already  
knows how to pass facet counts.  I wouldn't want to add such  
capabilities to our core searching classes, though.  Instead, I'd like  
to make IndexReader and Searchable extensible so that they can handle  
arbitrary Query subclasses, such as FacetQueries.  We'll see how that  
works out. :)

> Instead the problem can be better scalably solved by reusing the
> existing inverted and document indexes.  Really the only change
> necessary is accepting that  facet counting needs to be done during
> the main query, and not as a later step like gathering excepts.

Caching result seets for each facet into BitVectors or what-have-you  
is important because matching a particular facet is likely to be an  
expensive query.  Here are the top facet counts for a search at Amazon  
for "moon":

     * Books (343,415)
     * MP3 Downloads (35,160)
     * Home & Garden (22,234)
     * Music (7,086)
     ...

We don't want to execute 'category:books', 'category:mp3_downloads',  
etc. with each search query.  Therefore, we have to cache the result  
sets for those facets.  We can do it at search-time like so:

   for my $category (qw( books mp3_downloads home_and_garden music )) {
     my $bit_vec = KinoSearch::Util::BitVector->new(
       capacity => $searcher->max_docs + 1,
     );
     my $bit_collector =  
KinoSearch::Search::HitCollector::BitCollector->new(
       bit_vector => $bit_vector,
     );
     my $cat_query = KinoSearch::Search::TermQuery->new(
       field => 'category',
       term  => $category,
     );
     $searcher->collect(
        query     => $cat_query,
        collector => $bit_collector,
     );
     $facet_caches{$category} = $bit_vec;
   }

We could also potentially perform the same sort of search at index- 
time and write out the result sets as e.g. bit-arrays.  That way, it  
wouldn't be necessary to execute a bunch of searches at Search-time  
startup, especially if we could set up special BitVectors where the  
bit arrays were mmap'd against those files.

If I understand correctly, we'd get an even bigger benefit from index- 
time preparation of facet result sets when there are several search  
processes, since the data for the facet result sets would be cached in  
shared system memory pages, instead of in per-process BitVector objects.

> Instead of wrapping a query, one just uses a FacetCollector.  Hit by
> hit, this collector (which could be wrapping a HitCollector) steps
> through the facet field's DocVector (is this where the forward index
> of terms is kept?) adding up term occurrences.

I think there's some confusion as to the role of DocVector.  Each  
DocVector is basically a self-contained single-document inverted  
index, and retrieving a DocVector is approximately as expensive as  
retrieving a Doc.  Using DocVector while iterating through posting  
lists isn't feasible; retrieving one DocVector for each of the top  
hits after you've finished searching is.

The only thing that uses DocVector right now is the Highlighter.   
DocVector will also come in handy if we get an Explain() method  
happening for Query/Compiler.

> These facet counts are then returned as part of the response along
> with the top matches.  Caching, if it is to occur, happens in a
> centralized way (memcached) at the top level for the entire query,
> rather than component by component within the cluster.

I can't quite wrap my head around that.  Can you please elaborate?

> ps.  I can't tell at a glance how KinoSearch currently treats fields
> in its indexes.  For example, does each field get its own lexicon?

Yes, right now that is the case.  In every segment, each field gets  
its own .lex and .lexx "files
".  When there are a lot of fields, that means there are a huge number  
of such "files", but since KS uses a virtual file system and all those  
"files" are actually contained within a single .cf file per segment,  
we don't run out of descriptors.  Peek at the JSON inside a .cfmeta  
file to get an idea of what's going on.

Per-field lexicon files are not perfectly satisfactory, but explaining  
all the reasoning would require an extended digression.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list