[KinoSearch] Unicode problem
Marvin Humphrey
marvin at rectangular.com
Mon Mar 3 15:10:04 PST 2008
On Mar 3, 2008, at 8:43 AM, Father Chrysostomos wrote:
> I looked into it further and found that ‘ἐνδιαφέρον’
> came out encoded as UTF-8
> ("\341
> \274
> \220
> \316
> \275
> \316
> \264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"),
If we isolate the original and use Devel::Peek to inspect it...
use Devel::Peek;
my $greek = 'ἐνδιαφέρον';
Dump($greek);
... this is what we see:
SV = PV(0x91e374) at 0x8972dc
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x11742e0
"\341
\274
\220
\316
\275
\316
\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0
[UTF8 "\x{1f10}\x{3bd}
\x{3b4}\x{3b9}\x{3b1}\x{3c6}\x{1f73}\x{3c1}\x{3bf}\x{3bd}"]
CUR = 22
LEN = 24
Here's what's coming out of $lexicon->get_term:
SV = PV(0x91d0dc) at 0x912f80
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x117fce0
"\341
\274
\220
\316
\275
\316\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0
CUR = 22
LEN = 24
The strings have the same byte sequence, but the second one is missing
the UTF8 flag, so Perl is interpreting it as Latin1.
When we submit that scalar to $reader->doc_freq, the XS binding
extracts the string using SvPVutf8, which causes the supposedly Latin1
string to be, ahem, "upgraded" to UTF8. The resulting garbage isn't
in the index.
The problem was a missing SvUTF8_on in the XS binding for
Lexicon_Get_Term. Fixed by r3103. Thanks for the report.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the kinosearch
mailing list