[KinoSearch] utf8 warnings/error

Scott Beck scottbeck at gmail.com
Sun Aug 19 12:17:42 PDT 2007



Hi Marvin,


On 8/19/07, Marvin Humphrey <marvin at rectangular.com> wrote:
>
> On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:
>
> > I'm indexing emails, mostly spam, and I'm running into a bunch of
> > UTF-8 error followed by an error from PolyAnalyzer.
>
> All of KinoSearch's tools expect to be fed valid UTF-8.  It seems
> that they aren't getting it.
>
> KS is ready for two possibilities.
<snip>
>
>    1) SVf_UTF8 is set, and the string is truly UTF-8.
>    2) SVf_UTF8 is not set.  The string will be upgraded to
>       UTF-8, assuming a source encoding of Latin 1.
>
> Check your source strings via this line (see <http://perldoc.perl.org/
> utf8.html>):
>
>    utf8::valid($doc->{$field_name}) or die "Bad string!";
>
> If the string passes muster with utf8::valid() but KS still has
> problems, then KS has a bug.  If not, then there is a bug prior to KS
> in your app.
>

I tried this just before the add_doc in my code:

    for (keys %$email) {
        utf8::valid($email->{$_}) or die "Bad string!";
        warn "> $_ is valid utf8";
    }

I see the warn there for every field. Also I tried this just to make
sure the strings are UTF-8:

    for (keys %$email) {
        unless (utf8::is_utf8($email->{$_})) {
            utf8::upgrade($email->{$_});
        }
        utf8::valid($email->{$_}) or die "Bad string!";
        warn "> $_ is valid utf8";
    }

I get the same errors and warnings with either of these inserted just
before the add_doc().

Thanks,

Scott

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list