[KinoSearch] KinoSearch & Similar/Duplicate Documents

Peter Karman peter at peknet.com
Mon Feb 25 11:56:20 PST 2008



Vladimir Vlach wrote on 2/25/08 5:00 AM:
> Hello !
> 
> I love to use KinoSearch. So far It's doing everything we need for our
> project. I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?
> 

Duplicates can be identified simply by MD5-ing the doc content. That's what 
Swish-e's spider.pl does.

Similarity is a much tougher nut. LSA is a decent approach (as Marvin 
suggested). One Swish-e user tried this:

http://swish-e.org/archive/2005-02/8967.html

The key imo is to avoid indexing duplicate and for-some-value-of-similar 
documents in the first place. Implement these features at the document 
aggregator level, before handing them to KS.


> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?

If you have control over the content, you might add <!-- noindex --> tags around 
the stuff you want excluded, and then s/// that out before you pass to KS.

If you don't have control, and the improvement is worth your time, consider 
identifying some text patterns in your documents and just s/// those, as in the 
example above.

-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com



More information about the KinoSearch mailing list