[KinoSearch] Boolean searching across multiple fields

Marvin Humphrey marvin at rectangular.com
Wed Oct 11 17:32:55 PDT 2006


On Oct 11, 2006, at 4:05 PM, Chris Nandor wrote:

>>> I know in the next version I can do, simply:
>>>
>>>  	my $query_parser = KinoSearch::QueryParser::QueryParser->new(
>>>  		analyzer	=> $analyzer,
>>>  		fields		=> \@fields,
>>>  	);
>
> So this code will allow the above behavior, then?

Yes.  QueryParser behaves like this because it's the most intuitive  
behavior for the common case.

Most often, people want to search multiple fields -- say, title and  
body.  A required term such as "+senator" must match against AT LEAST  
ONE field out of several.  A prohibited term such as "-senator" MUST  
NOT MATCH AGAINST ANY of them.  It's as if all the fields were  
flattened into one and QueryParser was generating a query against  
that.  However, the scoring algorithm still gets to use multiple  
fields, which is important for returning the most relevant document set.

The guts that make that happen are kind of complicated (thank dog for  
tests!) but the concept is straightforward:
QueryParser processes the input string one chunk at a time.

Consider the following input:

     '+foo -bar "okee dokee"'

First chunk is '+foo'.  It gets expanded to...

     '+(title:foo OR body:foo)'

Next, '-bar' expands to...

     '-(title:bar OR body:bar)'

Lastly, the phrase '"okee dokee"' gets treated as a single chunk,  
expanding to...

     '(title:"okee dokee" OR body:"okee dokee")'

(Note that the internal mechanism isn't literal text expansion --  
QueryParser is using Query objects.)

> Curiously, how would I do it in 0.12?  Knowing that may help me  
> understand
> the whole thing better.

That particular configuation is actually kind of hard to nail with  
0.12.  The "negate operator bug" that was fixed in 0.13 actually  
affected queries in which all clauses are required too, of which your  
'+foo +bar' is the perfect reduced example.

QueryParser's clever trick is to handle the string chunk by chunk.   
There's no public API for squeezing chunks out of QueryParser one-at- 
a-time, though, so you can't duplicate the multi-field functionality  
easily.

As a workaround, you can dump all content into one big field.

    $doc->set_value( title       => $title );
    $doc->set_value( body        => $body );
    $doc->set_value( all_content => "$title $body" );

Then, you create a QueryParser against the all_content field, and  
your search for '+foo +bar' returns the correct set of documents.

     my $query_parser = KinoSearch::QueryParser::QueryParser->new(
         default_field => 'all_content',
     );
     my $query = $query_parser->parse('+foo +bar');

Essentially, you are flattening the fields yourself, rather than  
letting the QueryParser from KinoSearch 0.13 do it for you.

This option gets recommended all the time on the Lucene user's list,  
and it's OK for small document sets.  However, the relevancy from  
that searcg will be inferior to a search performed against multiple  
fields, because the title text gets dumped into all_content rather  
than staying separate -- where, as a short field, it will  
automatically be weighted more heavily.  With large document sets,  
relevancy becomes a major concern, and I recommend against this  
technique.

Another option is to rewrite your requirements.  :)  Make sure that  
'foo' and 'bar' come to you already split up -- say from different  
HTML form fields -- so you don't need to rely on QueryParser to break  
up the string and determine what's required/prohibited.  Then, you  
can build up your own compound BooleanQuery piece by piece.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





More information about the KinoSearch mailing list