[KinoSearch] KinoSearch & Similar/Duplicate Documents

Marvin Humphrey marvin at rectangular.com
Mon Feb 25 06:49:23 PST 2008


On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote:

> I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?

Off the top of my head, I don't know of an easy or reliable approach.   
I'm sure that there is academic research out there on the subject.

Brainstorming...

This is a two-stage problem.  The hard part is identifying candidates  
which may be similar to each other.  After you have candidates, then  
you can roll through the seemingly matching docs and see what kind of  
matching content is really there.  Is it boilerplate template code  
(e.g. nav menus) that ought to be discarded?  Or is this truly  
meaningful content which has been duplicated in multiple locations?

Say you were to build a pure vector space search engine, as described  
at <http://www.perl.com/pub/a/2003/02/19/engine.html>.  Then you  
perform a search using the entire contents of one document as a  
query.  Documents with duplicate content will appear nearly on top of  
each other in vector space.

An uncompressed vector space search engine is not feasible for large  
document collections;   however, I suspect that a decomposed vector  
engine a la LSA (latent semantic analysis) would do a good job at  
picking candidates.  An excellent introduction to LSA is available at <http://www.knowledgesearch.org/lsi/cover_page.htm 
 >.  (I've started collecting these links on a wiki page at <http://www.rectangular.com/kinosearch/wiki/VectorSpaceModel 
 >.)

The patent on Latent Semantic Analysis expires this year.  It ought to  
be possible to extend KinoSearch with a KSx::LSA distro, which would  
include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on.

> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?

I haven't studied this one in depth; from what I understand it's quite  
a difficult problem.  (I vaguely recall a discussion in some Lucene  
forum where Andrzej Bialecki, one of Lucene's biggest contributors,  
threw up his hands.)  Especially annoying is template code which  
varies subtly, making verification of suspected boilerplate a  
challenging prospect.  I can think of some vector-based techniques I  
might try, but hunting down academic research on the topic is likely  
to be more fruitful.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




More information about the KinoSearch mailing list