[KinoSearch] KinoSearch & Similar/Duplicate Documents

Vladimir Vlach vladaman at gmail.com
Mon Feb 25 03:00:43 PST 2008


Hello !

I love to use KinoSearch. So far It's doing everything we need for our
project. I wonder if you could suggest me a way how to retrieve
Similar documents and Duplicates. We index few web-sites and sometimes
the documents are posted with different URLs. How to solve this?

One of the issues we also have is not related to KinoSearch. We would
like to remove some parts of the page which are similar (let's say we
want to remove navigation menu shared on all pages). Remove the
content is quite easy, but how would you detect what parts are
repeated across pages? Diff algorithm? What kind of approach would you
suggest?

Thank you,
Vlad



More information about the kinosearch mailing list