[KinoSearch] Unicode problem
Father Chrysostomos
sprout at cpan.org
Mon Mar 3 08:43:29 PST 2008
There seems to be a problem with KinoSearch’s Unicode support. Greek
words can be listed in the index, but they always have a doc_freq of
0. The attached script demonstrates this problem. This is the output
it gives me:
Greek occurs in 1 document.
Hmm occurs in 1 document.
as occurs in 1 document.
in occurs in 1 document.
interesting occurs in 1 document.
or occurs in 1 document.
say occurs in 1 document.
they occurs in 1 document.
ἐνδιαφέρον occurs in 0 documents.
It didn’t give me any wide char warnings, so I looked into it further
and found that ‘ἐνδιαφέρον’ came out encoded as UTF-8
("\341
\274
\220
\316
\275
\316
\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"), so
maybe that’s part of the problem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unitest
Type: application/octet-stream
Size: 1021 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20080303/dc4c0870/unitest.obj
-------------- next part --------------
More information about the kinosearch
mailing list