[KinoSearch] rfc: faceted search
Marvin Humphrey
marvin at rectangular.com
Tue Jul 29 15:44:25 PDT 2008
On Jul 29, 2008, at 4:33 AM, Andrew Bramble wrote:
> A FS class would trawl the index on startup (or even better during
> index time and store with the invindex,
Caching on startup is a known successful design (used by Solr), but
storing with the invindex might work well, particularly if we can get
mmap happening properly.
(Note to Nate: mmap is something I've been considering for term
dictionaries as well, as it could potentially eliminate startup costs
for sort fields or even index Lexicons.)
> there's an API for index overlays... right?!?)
The architecture is mostly there. There's no API yet, but perhaps
this project will provide the necessary nudge. More at the bottom of
this reply...
> and generate bitvectors for desired field(s)
> terms - storing a 1 in doc_num position for documents posessing the
> given term. For a field with 100 facets or terms - you'd need 100 bit
> vectors of at most max_docs bits long.
Exactly.
> An FS query would wrap a regular KS query - AND'ing the query results
> with each term's cached bitvector to derive a count of documents
> within the wrapped query that posess that term ( 100 ANDs + 100 counts
> of bitvectors no greater than maxdoc bits ).
Yes. Something like this:
my $intersection = KinoSearch::Util::BitVector->new;
while ( my ( $field_name, $field_set ) = each %cached_bit_sets ) ) {
$intersection->copy($result_set);
$intersection->and($field_set);
$facet_counts{$field_name} = $intersection->count;
}
> I have made a VERY naive implementation of this without glueing into
> KinoSearch XS/C, since I confess to _barely_ grokking charmony + XS +
> C beyond kindergarten level.
Let's just worry about rapid prototyping for now.
> My next goal would be to better understand KS internals so as to
> * use KS BitVectors to replace scalars and vec
KinoSearch::Util::BitVector has now been exposed as a public class as
of r3661. The API is similar to that of Java's BitSet.
> * make Facet::Counter into KSx::Search::FacetQuery to collect the >0
> scored results of a child query and count facets this way instead of
> using KS::Search::Hits
If all we wanted was the facet counts, we could start with application
logic something like this...
my $result_set = KinoSearch::Util::BitVector->new(
capacity => $searcher->max_docs + 1,
);
my $bit_collector = KinoSearch::Search::HitCollector::BitCollector-
>new(
bit_vector => $result_set,
);
$searcher->collect(
query => $query,
collector => $bit_collector,
);
... followed by the facet count code above.
HitCollector, BitCollector, and Searcher::collect have now been
exposed as public APIs as of r3662.
I might suggest one other bullet point:we want to be thinking about
how to scale this across a search cluster from the get-go. We can get
facet counts working pretty fast using the techniques above, but only
for a single machine. It becomes more challenging with multiple nodes
because Searchable's collect() method can't be implemented to work
efficiently over a remote connection.
Perhaps opening up InvIndexer and IndexReader is the key. (We can
assume that we're extending the index with new files, though that
wouldn't strictly be necessary as the existing index files contain all
the necessary information.) Here's a rough sketch of how things might
work at index-time:
my $facet_writer = KSx::Facets2::FacetWriter->new( invindex =>
$invindex );
my $seg_writer = KinoSearch::Index::SegWriter->new(
invindex => $invindex,
);
$seg_writer->add_writer($facet_writer);
my $invindexer = KinoSearch::InvIndexer->(
invindex => $invindex,
seg_writer => $seg_writer,
);
...
And, at search-time:
my $facet_reader = KSx::Facets2::FacetReader->new(
invindex => $invindex,
fields => \@facet_fields,
);
my $reader = KinoSearch::Index::IndexReader->open(
invindex => $invindex,
);
$reader->add_readers($facet_reader);
my $searcher = KinoSearch::Searcher->new(
reader => $reader,
);
...
while ( my $cgi = CGI::Fast->new ) {
my $query = $queryparser->parse( $cgi->param('q') || '' );
my $facet_query = KSx::Facets2::FacetQuery->new(
query => $query,
fields => \@facet_fields,
);
my $hits = $searcher->search( query => $facet_query );
...
}
The main idea is that the FacetQuery class would rely on an
IndexReader having a FacetReader as one of its children. The
FacetReader would be responsible for holding the cached BitVectors.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list