[KinoSearch] Tokens with spaces inside

Roberto Henrí­quez roberto at freekeylabs.com
Wed Aug 22 03:38:12 PDT 2007


Hi all, new to the list.

I have just subscribed after searching the archives for something on 
this, but cannot find much. I would thank any pointers to relevant 
threads or documentation, or failing that, any ideas that could help... 
sorry if I'm asking too-obvious questions.

My problem is this: I want my tokens to be able to contain spaces 
inside. I'm building an index specifically for tags that can be made up 
of several words, and want searches to match exactly those words. Thus, 
I have built my polyanalyzer this way:

    my $normalizer = KinoSearch::Analysis::LCNormalizer->new;
    my $token_re = qr/
        \w[\w ]*     # our tokens can have word characteres AND spaces.
    /mxs;

    my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
        token_re => $token_re
    );

    my $analyzer =  KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $normalizer, $tokenizer ],
    );

I have built the analyzer without a stemmer as I want to search exactly 
for the words that compose the terms.

Now comes the first doubt: Is there any way to examine what terms are 
being indexed? That would help me confirm the tokenizer regex is correct 
(I have tested it outside and it seems correct to me, but would like to 
know what use does Kino of it).

Then with this analyzer I have built an index, but my searches return 
nothing. To build the queries I use a QueryParser that uses the analyzer 
above.

Using Data::Dumper I can examine the Query objects produced by the 
parser, and it appears the parser has split the query string in the 
spaces, disregarding what I (guess) should be the correct behavior, 
which would be to divide the query string in what my regex says are 
tokens (any word character followed by any number of word characters and 
spaces).

I have previously tried with a slightly similar regex: /\w[^,]*/msx 
because my tokens are comma-separated, but no luck either.

So, my questions... am I mistaken by expecting the query to have terms 
contain spaces just as I want my tokens to? Any other suggestions on how 
to solve the "terms that contain spaces" problem?


I'm pasting below a sample dump of a Query object made by a QueryParser 
that uses the analyzer above.

Thanks in advance!

--R

The query is "foo bar,baz" (without the double quotes). I want the 
search to be made of "foo bar" and "baz" as terms... but the parsed 
query looks otherwise:

Query: $VAR1 = bless( {
                 'clauses' => [
                                bless( {
                                         'query' => bless( {
                                                             'boost' => 1,
                                                             'term' => 
bless( {
                                                                                
'text' => 'foo',
                                                                                
'field' => 'tags'
                                                                              
}, 'KinoSearch::Index::Term' )
                                                           }, 
'KinoSearch::Search::TermQuery' ),
                                         'occur' => 'SHOULD'
                                       }, 
'KinoSearch::Search::BooleanClause' ),
                                bless( {
                                         'query' => bless( {
                                                             'boost' => 1,
                                                             'positions' 
=> [
                                                                              
0,
                                                                              
1
                                                                            
],
                                                             'terms' => [
                                                                          
bless( {
                                                                                   
'text' => 'bar',
                                                                                   
'field' => 'tags'
                                                                                 
}, 'KinoSearch::Index::Term' ),
                                                                          
bless( {
                                                                                   
'text' => 'baz',
                                                                                   
'field' => 'tags'
                                                                                 
}, 'KinoSearch::Index::Term' )
                                                                        ],
                                                             'slop' => 0,
                                                             'field' => 
'tags'
                                                           }, 
'KinoSearch::Search::PhraseQuery' ),
                                         'occur' => 'SHOULD'
                                       }, 
'KinoSearch::Search::BooleanClause' )
                              ],
                 'disable_coord' => 0,
                 'boost' => 1,
                 'max_clause_count' => 1024
               }, 'KinoSearch::Search::BooleanQuery' );




More information about the KinoSearch mailing list