[KinoSearch] Boolean searching across multiple fields
Marvin Humphrey
marvin at rectangular.com
Wed Oct 11 17:32:55 PDT 2006
On Oct 11, 2006, at 4:05 PM, Chris Nandor wrote:
>>> I know in the next version I can do, simply:
>>>
>>> my $query_parser = KinoSearch::QueryParser::QueryParser->new(
>>> analyzer => $analyzer,
>>> fields => \@fields,
>>> );
>
> So this code will allow the above behavior, then?
Yes. QueryParser behaves like this because it's the most intuitive
behavior for the common case.
Most often, people want to search multiple fields -- say, title and
body. A required term such as "+senator" must match against AT LEAST
ONE field out of several. A prohibited term such as "-senator" MUST
NOT MATCH AGAINST ANY of them. It's as if all the fields were
flattened into one and QueryParser was generating a query against
that. However, the scoring algorithm still gets to use multiple
fields, which is important for returning the most relevant document set.
The guts that make that happen are kind of complicated (thank dog for
tests!) but the concept is straightforward:
QueryParser processes the input string one chunk at a time.
Consider the following input:
'+foo -bar "okee dokee"'
First chunk is '+foo'. It gets expanded to...
'+(title:foo OR body:foo)'
Next, '-bar' expands to...
'-(title:bar OR body:bar)'
Lastly, the phrase '"okee dokee"' gets treated as a single chunk,
expanding to...
'(title:"okee dokee" OR body:"okee dokee")'
(Note that the internal mechanism isn't literal text expansion --
QueryParser is using Query objects.)
> Curiously, how would I do it in 0.12? Knowing that may help me
> understand
> the whole thing better.
That particular configuation is actually kind of hard to nail with
0.12. The "negate operator bug" that was fixed in 0.13 actually
affected queries in which all clauses are required too, of which your
'+foo +bar' is the perfect reduced example.
QueryParser's clever trick is to handle the string chunk by chunk.
There's no public API for squeezing chunks out of QueryParser one-at-
a-time, though, so you can't duplicate the multi-field functionality
easily.
As a workaround, you can dump all content into one big field.
$doc->set_value( title => $title );
$doc->set_value( body => $body );
$doc->set_value( all_content => "$title $body" );
Then, you create a QueryParser against the all_content field, and
your search for '+foo +bar' returns the correct set of documents.
my $query_parser = KinoSearch::QueryParser::QueryParser->new(
default_field => 'all_content',
);
my $query = $query_parser->parse('+foo +bar');
Essentially, you are flattening the fields yourself, rather than
letting the QueryParser from KinoSearch 0.13 do it for you.
This option gets recommended all the time on the Lucene user's list,
and it's OK for small document sets. However, the relevancy from
that searcg will be inferior to a search performed against multiple
fields, because the title text gets dumped into all_content rather
than staying separate -- where, as a short field, it will
automatically be weighted more heavily. With large document sets,
relevancy becomes a major concern, and I recommend against this
technique.
Another option is to rewrite your requirements. :) Make sure that
'foo' and 'bar' come to you already split up -- say from different
HTML form fields -- so you don't need to rely on QueryParser to break
up the string and determine what's required/prohibited. Then, you
can build up your own compound BooleanQuery piece by piece.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list