[KinoSearch] utf8 warnings/error

Scott Beck scottbeck at gmail.com
Sun Aug 19 11:28:48 PDT 2007



Hi,

I'm indexing emails, mostly spam, and I'm running into a bunch of
UTF-8 error followed by an error from PolyAnalyzer. Here are a few of
the warnings:

Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xcf,
immediately after start byte 0xfb) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.
Malformed UTF-8 character (unexpected non-continuation byte 0xea,
immediately after start byte 0xcd) in subroutine entry at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77.

If you want to see all of the warnings, let me know. And then the
error after the warnings looks like this:

[error] Caught exception in
GMail::Controller::User::Mail::Folder->begin "Error in function
XS_KinoSearch__Analysis__Tokenizer__do_analyze at
lib/KinoSearch.xs:4758: scanned past end of '
   ÄãºÃ:±¾¹«Ë¾ÏÖÏòÒµÅóÓÑÌá¹(c)Ò»ÏîÓÅ»Ý(´ú¿ª´úÊÛ·¢Æ±)¿É´úÀíÈ«¹ú¸÷µØ·¢Æ±,ÎÒ˾±ÈÆäËü¹«Ë¾ÓÅ»Ý%20,

»õµ½¸¶¿î,ÈçÓÐÐèÒªÇëÀ´µçÏê̸:

        ÁªÏµÈË:Öܽ¨·¢           ÁªÏµÊÖ»ú:(0)13543676298

'
         at /usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Analysis/PolyAnalyzer.pm
line 77
        KinoSearch::Analysis::PolyAnalyzer::analyze_field('KinoSearch::Analysis::PolyAnalyzer=HASH(0x8fa0adc)',
'HASH(0x897e0d4)', 'body') called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/Index/SegWriter.pm
line 104
        KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x8983774)',
'HASH(0x897e0d4)', 1) called at
/usr/lib/perl5/vendor_perl/5.8.4/i686-linux/KinoSearch/InvIndexer.pm
line 114
        KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x89849a4)',
'HASH(0x897e0d4)') called at
/usr/lib/gmail_maildir/GT/Maildir/KinoSearch/Indexer.pm line 200
        GT::Maildir::KinoSearch::Indexer::index('GT::Maildir::KinoSearch::Indexer=HASH(0x8b0e180)',
'/var/home/alex/alex.krohn.org/mail/alex/Maildir/./cur/1182973...')
called at GMail::Model::Maildir::Folder::index line 45
...
The rest of the stack trace is in my code.

Is there something I need to do to the strings I'm passing into add_doc?

Thanks,

Scott

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list