[KinoSearch] rfc: faceted search
Nathan Kurz
nate at verse.com
Fri Aug 1 12:32:59 PDT 2008
On Wed, Jul 30, 2008 at 1:38 PM, Marvin Humphrey <marvin at rectangular.com> wrote:
>
> On Jul 30, 2008, at 11:03 AM, Nathan Kurz wrote:
>
>> The bit vectors become
>> unworkable once you have a large number of facets (think of using
>> 'author' as a facet on a large dataset),
>
> If you anticipate needing a large number of facets with small data sets,
> that's where you don't use BitVectors any more.
Great, I think we agree here. The question might be what a 'large
number' means in this context. My guess is that the 'large number'
break point is somewhere between 10 and 100. My other guess (probably
skewed by my own interests) is that most of the interesting
applications are going to require 'large numbers' of facets, often
'very large numbers'. Since solutions that work with a the large
number of facets can (IMO) work adequately for the situations with
small numbers (but not vice versa), the large number solution strikes
me as primary.
> Instead you use a short
> array of sorted integers, possibly compressed -- I think Solr uses a class
> called SortedVIntList for this. If we added such a feature, we'd need to
> create methods to and() and or() BitVectors against these integer lists
> instead of against other BitVectors.
Rather than adding a separate infrastructure for doing Boolean
operations across BitVectors and SortedVIntList, I would strongly
suggest shoehorning these structures into the existing index
structures and use the existing scoring system. Just as as should be
possible for an index to use P4Delta compression, it should be
possible for an index to store information as a BitVector. This buys
you a lot more flexibility with a cleaner infrastructure. I'm also
suggesting that the current field specific inverted index would work
fine for many applications.
>> and the wrapped query
>> approach doesn't spread well to a cluster (as one would be forced to
>> round-trip the full list of matching docs over the net and not just
>> the top-n).
>
> You never need to communicate the full list of matching docs across the
> network. All you need to communicate is the top-n matching docs and the
> facet counts, which isn't a problem.
Great. This was my misunderstanding of the proposed solution.
> How to pass those facet counts, I'm not exactly sure.
There are lots of solutions to this, and it doesn't strike me as a
worry at all. It can be solved (and changed) after the internals are
sorted out.
> Caching result seets for each facet into BitVectors or what-have-you is
> important because matching a particular facet is likely to be an expensive
> query. Here are the top facet counts for a search at Amazon for "moon":
>
> * Books (343,415)
> * MP3 Downloads (35,160)
> * Home & Garden (22,234)
> * Music (7,086)
> ...
>
> We don't want to execute 'category:books', 'category:mp3_downloads', etc.
> with each search query. Therefore, we have to cache the result sets for
> those facets.
You are right that we don't want to go through the inverted index for
each of the possible facets, but this is why I was suggesting that we
could step through the forward indexed (doc->termlist) DocVectors to
count facets. While this could be counted as expensive, I think it
would be on a par with scoring a relatively common search term in the
inverted index, say, the search for 'moon' itself. I also think it
would be more efficient than merging 100 BitVectors, although less
efficient than merging 10.
> If I understand correctly, we'd get an even bigger benefit from index-time
> preparation of facet result sets when there are several search processes,
> since the data for the facet result sets would be cached in shared system
> memory pages, instead of in per-process BitVector objects.
Yes, a fine plan. There is definitely a benefit to having these
cached in files rather than in per-process memory. Accessing these
files via mmap gives some extra benefit, but since the file system
effectively uses mmap internally, you'll get most of this cross-proces
benefit using standard IO as well.
> I think there's some confusion as to the role of DocVector. Each DocVector
> is basically a self-contained single-document inverted index, and retrieving
> a DocVector is approximately as expensive as retrieving a Doc. Using
> DocVector while iterating through posting lists isn't feasible; retrieving
> one DocVector for each of the top hits after you've finished searching is.
Could it be made to be feasible? While I understand that random
access to the DocVectors will be expensive, I was hoping that
sequentially stepping through the TermVectors for a field could be
done very efficiently, especially for unanalyzed fields with a limited
vocabulary (ie, for a field used to store facets). Conceptually,
stepping over a doc->termList index doesn't strike me as inherently
any worse than stepping through the term->docList inverted index.
>> Caching, if it is to occur, happens in a
>> centralized way (memcached) at the top level for the entire query,
>> rather than component by component within the cluster.
>
> I can't quite wrap my head around that. Can you please elaborate?
I'm offering a philosophical conviction that it's best to do the
caching at the highest level possible rather than complicating the
lower levels. Rather than farming the request out to the cluster and
having each member check a local cache for facet values, I'm
suggesting that it would be better for the central dispatch to cache
the finished client response. I think this is easier to scale, makes
better use of system resources, and yields simpler and easier to
debug code.
>> ps. I can't tell at a glance how KinoSearch currently treats fields
>> in its indexes. For example, does each field get its own lexicon?
>
> Yes, right now that is the case. [...]
> Per-field lexicon files are not perfectly satisfactory, but explaining all
> the reasoning would require an extended digression.
Great, that's what I was hoping. I'm not sure what the downsides are,
but having per field lexicons means that limited vocabulary fields can
be stored much more efficiently than if they shared a lexicon with
everyone else.
Nathan Kurz
nate at verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list