[KinoSearch] Unicode problem

Marvin Humphrey marvin at rectangular.com
Mon Mar 3 15:10:04 PST 2008


On Mar 3, 2008, at 8:43 AM, Father Chrysostomos wrote:

>  I looked into it further and found that ‘ἐνδιαφέρον’  
> came out encoded as UTF-8  
> ("\341 
> \274 
> \220 
> \316 
> \275 
> \316 
> \264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"),

If we isolate the original and use Devel::Peek to inspect it...

   use Devel::Peek;
   my $greek = 'ἐνδιαφέρον';
   Dump($greek);

... this is what we see:

SV = PV(0x91e374) at 0x8972dc
   REFCNT = 1
   FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
   PV = 0x11742e0  
"\341 
\274 
\220 
\316 
\275 
\316 
\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0  
[UTF8 "\x{1f10}\x{3bd} 
\x{3b4}\x{3b9}\x{3b1}\x{3c6}\x{1f73}\x{3c1}\x{3bf}\x{3bd}"]
   CUR = 22
   LEN = 24

Here's what's coming out of $lexicon->get_term:

SV = PV(0x91d0dc) at 0x912f80
   REFCNT = 1
   FLAGS = (PADBUSY,PADMY,POK,pPOK)
   PV = 0x117fce0  
"\341 
\274 
\220 
\316 
\275 
\316\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0
   CUR = 22
   LEN = 24

The strings have the same byte sequence, but the second one is missing  
the UTF8 flag, so Perl is interpreting it as Latin1.

When we submit that scalar to $reader->doc_freq, the XS binding  
extracts the string using SvPVutf8, which causes the supposedly Latin1  
string to be, ahem, "upgraded" to UTF8.  The resulting garbage isn't  
in the index.

The problem was a missing SvUTF8_on in the XS binding for  
Lexicon_Get_Term.  Fixed by r3103.  Thanks for the report.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the kinosearch mailing list