[KinoSearch] Highlighter and UTF-8 in 0.14
Eric LIAO Kehuang
ekliao at gmail.com
Thu Nov 16 14:15:40 PST 2006
Thanks Marvin. If I may voice some opinion on this: Having read most of the
discussions on the forum regarding KS support for Unicode, I strongly agree
that the all-Unicode approach is the way to go. ISO-8859-1 users can easily
prepare for indexing and convert data back to ISO-8859-1 using Encode for
displaying results on their web sites. That takes only a little effort on
top of the out-of-the-box KS functionality. I feel the advantage of
all-Unicode much outweighs their effort to add the transcoding layer at both
ends. In today's world we have multilingual text to search, in different
non-Unicode encodings. It's no longer just "Western European" languages any
more. Unicode is one good way to tie it all together.
Can you confirm that aside from Highlighter, indexing/searching of utf-8
data works correctly in 0.14? Or, is the tokenization also broken? (For
example, would 0.14 break a French word like "immédiat" into "imm" and
"diat"?) I'd like to know if I can still deploy KS for utf8 data and come
up with a way of dealing with highlighting.
I really look forward to 0.20 :)
Thanks!
Eric
On 11/16/06, Marvin Humphrey <marvin at rectangular.com> wrote:
>
>
> On Nov 15, 2006, at 6:03 PM, Eric LIAO Kehuang wrote:
>
> > In trying out indexing UTF-8 data with 0.12, I found that while
> > searching seems to work (recognizing accented French characters in
> > UTF-8), the Highlighter still messes up the accented characters due
> > to its byte:: semantics. Is this problem fixed with the recent
> > release of 0.14?
>
> Nope, still a problem. Fixing UTF-8 compatibility will break
> backwards compatibility because the only way to make it work is to go
> all-unicode throughout KS. So, the goal is to have the UTF-8 bugs
> sorted for version 0.20.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch at rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20061116/dbca96e2/attachment.htm
More information about the kinosearch
mailing list