[KinoSearch] rfc: faceted search

Marvin Humphrey marvin at rectangular.com
Tue Jul 29 15:44:25 PDT 2008




On Jul 29, 2008, at 4:33 AM, Andrew Bramble wrote:

> A FS class would trawl the index on startup (or even better during
> index time and store with the invindex,

Caching on startup is a known successful design (used by Solr), but  
storing with the invindex might work well, particularly if we can get  
mmap happening properly.

(Note to Nate: mmap is something I've been considering for term  
dictionaries as well, as it could potentially eliminate startup costs  
for sort fields or even index Lexicons.)

> there's an API for index overlays... right?!?)

The architecture is mostly there.  There's no API yet, but perhaps  
this project will provide the necessary nudge.  More at the bottom of  
this reply...

> and generate bitvectors for desired field(s)
> terms - storing a 1 in doc_num position for documents posessing the
> given term. For a field with 100 facets or terms  - you'd need 100 bit
> vectors of at most max_docs bits long.

Exactly.

> An FS query would wrap a regular KS query - AND'ing the query results
> with each term's cached bitvector to derive a count of documents
> within the wrapped query that posess that term ( 100 ANDs + 100 counts
> of bitvectors no greater than maxdoc bits ).

Yes.  Something like this:

    my $intersection = KinoSearch::Util::BitVector->new;
    while ( my ( $field_name, $field_set ) = each %cached_bit_sets ) ) {
       $intersection->copy($result_set);
       $intersection->and($field_set);
       $facet_counts{$field_name} = $intersection->count;
    }

> I have made a VERY naive implementation of this without glueing into
> KinoSearch XS/C, since I confess to _barely_ grokking charmony + XS +
> C beyond kindergarten level.

Let's just worry about rapid prototyping for now.

> My next goal would be to better understand KS internals so as to
>  * use KS BitVectors to replace scalars and vec

KinoSearch::Util::BitVector has now been exposed as a public class as  
of r3661.  The API is similar to that of Java's BitSet.

>  * make Facet::Counter into KSx::Search::FacetQuery to collect the >0
> scored results of a child query and count facets this way instead of
> using KS::Search::Hits

If all we wanted was the facet counts, we could start with application  
logic something like this...

    my $result_set = KinoSearch::Util::BitVector->new(
        capacity => $searcher->max_docs + 1,
    );
    my $bit_collector = KinoSearch::Search::HitCollector::BitCollector- 
 >new(
       bit_vector => $result_set,
    );
    $searcher->collect(
       query     => $query,
       collector => $bit_collector,
    );

... followed by the facet count code above.

HitCollector, BitCollector, and Searcher::collect have now been  
exposed as public APIs as of r3662.

I might suggest one other bullet point:we want to be thinking about  
how to scale this across a search cluster from the get-go.  We can get  
facet counts working pretty fast using the techniques above, but only  
for a single machine.  It becomes more challenging with multiple nodes  
because Searchable's collect() method can't be implemented to work  
efficiently over a remote connection.

Perhaps opening up InvIndexer and IndexReader is the key.  (We can  
assume that we're extending the index with new files, though that  
wouldn't strictly be necessary as the existing index files contain all  
the necessary information.)  Here's a rough sketch of how things might  
work at index-time:

   my $facet_writer = KSx::Facets2::FacetWriter->new( invindex =>  
$invindex );
   my $seg_writer = KinoSearch::Index::SegWriter->new(
       invindex => $invindex,
   );
   $seg_writer->add_writer($facet_writer);
   my $invindexer = KinoSearch::InvIndexer->(
      invindex   => $invindex,
      seg_writer => $seg_writer,
   );
   ...

And, at search-time:

   my $facet_reader = KSx::Facets2::FacetReader->new(
      invindex => $invindex,
      fields   => \@facet_fields,
   );
   my $reader = KinoSearch::Index::IndexReader->open(
      invindex => $invindex,
   );
   $reader->add_readers($facet_reader);
   my $searcher = KinoSearch::Searcher->new(
      reader => $reader,
   );
   ...

   while ( my $cgi = CGI::Fast->new ) {
      my $query = $queryparser->parse( $cgi->param('q') || '' );
      my $facet_query = KSx::Facets2::FacetQuery->new(
         query  => $query,
         fields => \@facet_fields,
      );
      my $hits = $searcher->search( query => $facet_query );
      ...
   }

The main idea is that the FacetQuery class would rely on an  
IndexReader having a FacetReader as one of its children.  The  
FacetReader would be responsible for holding the cached BitVectors.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list