[KinoSearch] KinoSearch & Similar/Duplicate Documents
Marvin Humphrey
marvin at rectangular.com
Mon Feb 25 06:49:23 PST 2008
On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote:
> I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?
Off the top of my head, I don't know of an easy or reliable approach.
I'm sure that there is academic research out there on the subject.
Brainstorming...
This is a two-stage problem. The hard part is identifying candidates
which may be similar to each other. After you have candidates, then
you can roll through the seemingly matching docs and see what kind of
matching content is really there. Is it boilerplate template code
(e.g. nav menus) that ought to be discarded? Or is this truly
meaningful content which has been duplicated in multiple locations?
Say you were to build a pure vector space search engine, as described
at <http://www.perl.com/pub/a/2003/02/19/engine.html>. Then you
perform a search using the entire contents of one document as a
query. Documents with duplicate content will appear nearly on top of
each other in vector space.
An uncompressed vector space search engine is not feasible for large
document collections; however, I suspect that a decomposed vector
engine a la LSA (latent semantic analysis) would do a good job at
picking candidates. An excellent introduction to LSA is available at <http://www.knowledgesearch.org/lsi/cover_page.htm
>. (I've started collecting these links on a wiki page at <http://www.rectangular.com/kinosearch/wiki/VectorSpaceModel
>.)
The patent on Latent Semantic Analysis expires this year. It ought to
be possible to extend KinoSearch with a KSx::LSA distro, which would
include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on.
> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?
I haven't studied this one in depth; from what I understand it's quite
a difficult problem. (I vaguely recall a discussion in some Lucene
forum where Andrzej Bialecki, one of Lucene's biggest contributors,
threw up his hands.) Especially annoying is template code which
varies subtly, making verification of suspected boilerplate a
challenging prospect. I can think of some vector-based techniques I
might try, but hunting down academic research on the topic is likely
to be more fruitful.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
More information about the KinoSearch
mailing list