[KinoSearch] utf8 warnings/error
Scott Beck
scottbeck at gmail.com
Sun Aug 19 12:17:42 PDT 2007
Hi Marvin,
On 8/19/07, Marvin Humphrey <marvin at rectangular.com> wrote:
>
> On Aug 19, 2007, at 11:28 AM, Scott Beck wrote:
>
> > I'm indexing emails, mostly spam, and I'm running into a bunch of
> > UTF-8 error followed by an error from PolyAnalyzer.
>
> All of KinoSearch's tools expect to be fed valid UTF-8. It seems
> that they aren't getting it.
>
> KS is ready for two possibilities.
<snip>
>
> 1) SVf_UTF8 is set, and the string is truly UTF-8.
> 2) SVf_UTF8 is not set. The string will be upgraded to
> UTF-8, assuming a source encoding of Latin 1.
>
> Check your source strings via this line (see <http://perldoc.perl.org/
> utf8.html>):
>
> utf8::valid($doc->{$field_name}) or die "Bad string!";
>
> If the string passes muster with utf8::valid() but KS still has
> problems, then KS has a bug. If not, then there is a bug prior to KS
> in your app.
>
I tried this just before the add_doc in my code:
for (keys %$email) {
utf8::valid($email->{$_}) or die "Bad string!";
warn "> $_ is valid utf8";
}
I see the warn there for every field. Also I tried this just to make
sure the strings are UTF-8:
for (keys %$email) {
unless (utf8::is_utf8($email->{$_})) {
utf8::upgrade($email->{$_});
}
utf8::valid($email->{$_}) or die "Bad string!";
warn "> $_ is valid utf8";
}
I get the same errors and warnings with either of these inserted just
before the add_doc().
Thanks,
Scott
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list