[KinoSearch] Match parts of a word
Marvin Humphrey
marvin at rectangular.com
Sat Dec 20 14:41:45 PST 2008
Jonas Kaufmann:
> we did simple SQL LIKE "%query%"-queries for our search, but this
> became too slow now. In the old version, a search for "ham" would
> match any occurence of the word "ham" as well as "hamburg", "hamfoo"
> and "foohambar". In the KinoSearch version, only the word "ham" is
> matched.
>
> Do you know any method to get KinoSearch to work like our old search
> function? I think it might be possible using a special tokenizer for
> this...
This is not presently possible. You would need a custom Analyzer, but support
for custom Analyzers has been removed until the internal implementation
settles down.
Once custom Analyzers are supported once more, it will be possible but
expensive. You'll need to index every substring, which will result in a very
large index size: instead of saving one term for "hamster", you'll save...
h ha ham hams hamst hamste hamster
a am ams amst amste amster
m ms mst mste mster
s st ste ster
t te ter
e er
r
That looks obscenely inefficient, and it probably will be -- but as a
brute-force solution it should work for small document collections.
Once KinoSearch has an official C API and a pluggable indexing architecture,
then it will be possible for some enterprising individual to write a proper
KSx component to support SQL LIKE %query%-queries. I doubt that the inverted
index data structures at the heart of the KS core are the best possible tools
for that task.
Marvin Humphrey
More information about the kinosearch
mailing list