[KinoSearch] KinoSearch & Similar/Duplicate Documents

Nathan Kurz nate at verse.com
Mon Feb 25 22:30:52 PST 2008


On 2/25/08, Vladimir Vlach <vladaman at gmail.com> wrote:
>  One of the issues we also have is not related to KinoSearch. We would
>  like to remove some parts of the page which are similar (let's say we
>  want to remove navigation menu shared on all pages). Remove the
>  content is quite easy, but how would you detect what parts are
>  repeated across pages? Diff algorithm? What kind of approach would you
>  suggest?

I recently was talking with a friend about how to do this for indexing
a blog aggregator.   For his case, a straight 'diff' type algorithm
wasn't going to work very well due to rotating ads and page specific
navigation.   Peter's suggestions (custom regexps) make good sense if
you have if you have control of the pages or have a set number of
sites which you are scraping.

Another approach would be to do the analysis at the DOM level rather
than the text level.  There's an HTML::ContentExtractor module that
might be a good starting point for this:
<http://search.cpan.org/~jzhang/HTML-ContentExtractor/lib/HTML/ContentExtractor.pm>
It does DOM parsing, and makes simple statistical guesses about what
is real content and what is junk based on the percentage of text to
tags.   With a better (or per site customized) algorithm for
classification, I think this has potential.

For my friend, it was possible that http://dapper.net was going to be
useful as well.  Dapper is a web service that lets you create
customized RSS feeds of sites based on graphically entered parameters.
 Probably not going to work for your needs, but might be worth
checking out for ideas.

Good luck!

Nathan Kurz
nate at verse.com



More information about the KinoSearch mailing list