[KinoSearch] Invalid UTF-8
Peter Karman
peter at peknet.com
Mon Jan 25 08:07:52 PST 2010
Now, I'm seeing this error against latest svn trunk:
Invalid UTF-8 sequence in
'/opt/pij/search/sources.index.ks/seg_1/lextemp-12464-to-1353267' at
byte 12466, kino_TextTermStepper_read_delta at
../core/KinoSearch/FieldType/TextType.c line 145
The frustrating thing is that I just spent 2 weeks making sure my files
are all valid UTF-8 (same old story -- legacy db with mix of latin1,
cp1252, and UTF-8, sometimes all in the same string!), and they all pass
my Search::Tools::UTF8 checks.
What's odd is that the 'Invalid UTF-8 sequence' error is thrown during
commit() rather than when I add_doc(), which makes me think that perhaps
this isn't necessarily an encoding problem with my docs. I see that all
text strings are forced to UTF-8 in add_doc() via invert_doc() and the
SvPVutf8 call, so presumably they should all be UTF-8 by the time they
reach the commit()?
--
Peter Karman . http://peknet.com/ . peter at peknet.com
More information about the kinosearch
mailing list