[KinoSearch] Match parts of a word

Marvin Humphrey marvin at rectangular.com
Sat Dec 20 14:41:45 PST 2008


Jonas Kaufmann:

> we did simple SQL LIKE "%query%"-queries for our search, but this 
> became too slow now. In the old version, a search for "ham" would
> match any occurence of the word "ham" as well as "hamburg", "hamfoo"
> and "foohambar". In the KinoSearch version, only the word "ham" is
> matched.
> 
> Do you know any method to get KinoSearch to work like our old search
> function? I think it might be possible using a special tokenizer for
> this...

This is not presently possible.  You would need a custom Analyzer, but support
for custom Analyzers has been removed until the internal implementation
settles down.  

Once custom Analyzers are supported once more, it will be possible but
expensive.  You'll need to index every substring, which will result in a very
large index size: instead of saving one term for "hamster", you'll save...

  h ha ham hams hamst hamste hamster
     a  am  ams  amst  amste  amster
         m   ms   mst   mste   mster
              s    st    ste    ster
                    t     te     ter
                           e      er
                                   r

That looks obscenely inefficient, and it probably will be -- but as a
brute-force solution it should work for small document collections.

Once KinoSearch has an official C API and a pluggable indexing architecture,
then it will be possible for some enterprising individual to write a proper
KSx component to support SQL LIKE %query%-queries.  I doubt that the inverted
index data structures at the heart of the KS core are the best possible tools
for that task.

Marvin Humphrey





More information about the kinosearch mailing list