Re: how to avoid dupes?

Top Page

Reply to this message
Author: Andy Heath
Date:  
To: rawdog-users
Subject: Re: how to avoid dupes?
I don't have the answer to this but I do have the same
difficulty (and another couple of things I posted on).
I suspect maybe adam is away or not seeing these emails ?

andy

> hi adam and list,
>
> i'm using rawdog quite happily for over a year and my
> config file reads 117 feeds by now.
>
> there's one thing that keeps nagging me and that seems to be
> getting worse: duplicate articles. for various sites i get duplicated
> articles almost every day. for other sites i never get them.
> on average i skip over circa 10-15 dupes every day on my
> rawdog page.
>
> and this annoys me very much, since i'm using rawdog to get
> at the stuff faster and not to dig through yet more spam.
>
> in fact it's annoyed me enough to investigate the reasons and so
> i picked out one example. the following slashdot-article
> appeared yesterday and today again on my rawdog page:
>
> (snippet from current rdf at http://rss.slashdot.org/Slashdot/slashdot/to)
>
> --snip--
>
> <item>
> <title>4 Seconds Loading Time Is Maximum For
> Websurfers</title>
>
> <link>
> http://rss.slashdot.org/~r/Slashdot/slashdot/to/~3/46648777/article.pl
> </link>
>
> <description>
> <p><a
> href="http://rss.slashdot.org/~a/Slashdot/slashdot/to?a=U9Y6Q0"><img
> src="http://rss.slashdot.org/~a/Slashdot/slashdot/to?i=U9Y6Q0"
> border="0"></img></a></p><img
> src="http://rss.slashdot.org/~r/Slashdot/slashdot/to/~4/46648777"/>
> </description>
>
> <feedburner:origLink>
> http://slashdot.org/article.pl?sid=06/11/08/1352211&from=rss
> </feedburner:origLink>
> </item>
>
> --snap--
>
> please note that the article is listed only once in the
> rss file but it has obviously changed between yesterday
> today.
>
> looking at the html generated by rawdog i see the following
> differences:
>
> 1: href="http://rss.slashdot.org/~r/Slashdot/slashdot/to/~3/46648777/article.pl"
> 2: href="http://rss.slashdot.org//~r/Slashdot/slashdot/to/~3/46648777/article.pl"
>
> 1: href="http://rss.slashdot.org/~a/Slashdot/slashdot/to?a=U9Y6Q0"
> 2: href="http://rss.slashdot.org//~a/Slashdot/slashdot/to?a=vGm4Eq"
>
> 1: img src="http://rss.slashdot.org/~a/Slashdot/slashdot/to?i=U9Y6Q0"
> 2: img src="http://rss.slashdot.org//~a/Slashdot/slashdot/to?i=vGm4Eq"
>
> slashdot provides a good example for what i'd like to call
> "idiots who put ads and other dynamic stuff into rss".
>
> obviously rawdog is not at fault here. there's neither a
> unique id or anything else to indicate that these two are
> actually the same article.
>
> still, there's so much broken RSS out in the wild that i
> think it may be worth to add some heuristics to the feed
> reader that help to reduce the level of spam?
>
> an effective measure could be to to store a normalized
> hash over title and body of each article and use that
> to sort out the dupes.
>
> i would imagine the following normalizations,
> configurable per-feed (or better yet, a simple
> script-filter hook):
>
> - normalize all urls; should catch the double-slashes
> - ignore get-parameters on urls; should catch some ad-crap
> - strip all <img> tags (or even all html tags?)
> - strip custom regex
>
> alternatively and maybe easier: provide a per-feed config switch
> along the lines of "use only title as article id". this alone would
> probably already fix 99% of the offending sites.
>
> i realize that some (all) of this stuff can possibly be
> achieved with a plugin. but my python is "basic" at best,
> so it's easier for me to post a little rant here than to
> dive into the code and do the work myself. :-)
>
> is anyone else having the dupe-problem?
> preferably someone familar with the rawdog code? ;-)
>
>
> regards,
> moe
>
>
> _______________________________________________
> rawdog-users mailing list
> rawdog-users@???
> http://lists.us-lot.org/mailman/listinfo/rawdog-users
>
>



_______________________________________________
rawdog-users mailing list
rawdog-users@???
http://lists.us-lot.org/mailman/listinfo/rawdog-users