[KinoSearch] utf8 warnings/error
Marvin Humphrey
marvin at rectangular.com
Sun Aug 19 11:59:37 PDT 2007
On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:
> I'm indexing emails, mostly spam, and I'm running into a bunch of
> UTF-8 error followed by an error from PolyAnalyzer.
All of KinoSearch's tools expect to be fed valid UTF-8. It seems
that they aren't getting it.
There is a line in SegWriter that should take any field value which
does not have the SVf_UTF8 flag set and force it into UTF-8 before it
gets sent through the analysis chain.
if ( !$field_spec->binary ) {
utf8ify( $doc->{$field_name} );
}
However, if the SVf_UTF8 flag is already set, utf8ify() does nothing.
What I would like to know is whether the incoming field values are
marked with the SVf_UTF8 flag, but are not truly valid UTF-8.
Strings in that state are bad news.
> Is there something I need to do to the strings I'm passing into
> add_doc?
KS is ready for two possibilities.
1) SVf_UTF8 is set, and the string is truly UTF-8.
2) SVf_UTF8 is not set. The string will be upgraded to
UTF-8, assuming a source encoding of Latin 1.
Check your source strings via this line (see <http://perldoc.perl.org/
utf8.html>):
utf8::valid($doc->{$field_name}) or die "Bad string!";
If the string passes muster with utf8::valid() but KS still has
problems, then KS has a bug. If not, then there is a bug prior to KS
in your app.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list