[KinoSearch] utf8 warnings/error

Marvin Humphrey marvin at rectangular.com
Sun Aug 19 11:59:37 PDT 2007




On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:

> I'm indexing emails, mostly spam, and I'm running into a bunch of
> UTF-8 error followed by an error from PolyAnalyzer.

All of KinoSearch's tools expect to be fed valid UTF-8.  It seems  
that they aren't getting it.

There is a line in SegWriter that should take any field value which  
does not have the SVf_UTF8 flag set and force it into UTF-8 before it  
gets sent through the analysis chain.

         if ( !$field_spec->binary ) {
             utf8ify( $doc->{$field_name} );
         }

However, if the SVf_UTF8 flag is already set, utf8ify() does nothing.

What I would like to know is whether the incoming field values are  
marked with the SVf_UTF8 flag, but are not truly valid UTF-8.   
Strings in that state are bad news.

> Is there something I need to do to the strings I'm passing into  
> add_doc?

KS is ready for two possibilities.

   1) SVf_UTF8 is set, and the string is truly UTF-8.
   2) SVf_UTF8 is not set.  The string will be upgraded to
      UTF-8, assuming a source encoding of Latin 1.

Check your source strings via this line (see <http://perldoc.perl.org/ 
utf8.html>):

   utf8::valid($doc->{$field_name}) or die "Bad string!";

If the string passes muster with utf8::valid() but KS still has  
problems, then KS has a bug.  If not, then there is a bug prior to KS  
in your app.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list