[KinoSearch] KinoSearch & Similar/Duplicate Documents
Vladimir Vlach
vladaman at gmail.com
Mon Feb 25 03:00:43 PST 2008
Hello !
I love to use KinoSearch. So far It's doing everything we need for our
project. I wonder if you could suggest me a way how to retrieve
Similar documents and Duplicates. We index few web-sites and sometimes
the documents are posted with different URLs. How to solve this?
One of the issues we also have is not related to KinoSearch. We would
like to remove some parts of the page which are similar (let's say we
want to remove navigation menu shared on all pages). Remove the
content is quite easy, but how would you detect what parts are
repeated across pages? Diff algorithm? What kind of approach would you
suggest?
Thank you,
Vlad
More information about the kinosearch
mailing list