[KinoSearch] KinoSearch & Similar/Duplicate Documents
Peter Karman
peter at peknet.com
Mon Feb 25 11:56:20 PST 2008
Vladimir Vlach wrote on 2/25/08 5:00 AM:
> Hello !
>
> I love to use KinoSearch. So far It's doing everything we need for our
> project. I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?
>
Duplicates can be identified simply by MD5-ing the doc content. That's what
Swish-e's spider.pl does.
Similarity is a much tougher nut. LSA is a decent approach (as Marvin
suggested). One Swish-e user tried this:
http://swish-e.org/archive/2005-02/8967.html
The key imo is to avoid indexing duplicate and for-some-value-of-similar
documents in the first place. Implement these features at the document
aggregator level, before handing them to KS.
> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?
If you have control over the content, you might add <!-- noindex --> tags around
the stuff you want excluded, and then s/// that out before you pass to KS.
If you don't have control, and the improvement is worth your time, consider
identifying some text patterns in your documents and just s/// those, as in the
example above.
--
Peter Karman . http://peknet.com/ . peter at peknet.com
More information about the KinoSearch
mailing list