hi adam and list,
i'm using rawdog quite happily for over a year and my
config file reads 117 feeds by now.
there's one thing that keeps nagging me and that seems to be
getting worse: duplicate articles. for various sites i get duplicated
articles almost every day. for other sites i never get them.
on average i skip over circa 10-15 dupes every day on my
rawdog page.
and this annoys me very much, since i'm using rawdog to get
at the stuff faster and not to dig through yet more spam.
in fact it's annoyed me enough to investigate the reasons and so
i picked out one example. the following slashdot-article
appeared yesterday and today again on my rawdog page:
(snippet from current rdf at
http://rss.slashdot.org/Slashdot/slashdot/to)
--snip--
<item>
<title>4 Seconds Loading Time Is Maximum For
Websurfers</title>
<link>
http://rss.slashdot.org/~r/Slashdot/slashdot/to/~3/46648777/article.pl
</link>
<description>
<p><a
href="
http://rss.slashdot.org/~a/Slashdot/slashdot/to?a=U9Y6Q0"><img
src="
http://rss.slashdot.org/~a/Slashdot/slashdot/to?i=U9Y6Q0"
border="0"></img></a></p><img
src="
http://rss.slashdot.org/~r/Slashdot/slashdot/to/~4/46648777"/>
</description>
<feedburner:origLink>
http://slashdot.org/article.pl?sid=06/11/08/1352211&from=rss
</feedburner:origLink>
</item>
--snap--
please note that the article is listed only once in the
rss file but it has obviously changed between yesterday
today.
looking at the html generated by rawdog i see the following
differences:
1: href="
http://rss.slashdot.org/~r/Slashdot/slashdot/to/~3/46648777/article.pl"
2: href="
http://rss.slashdot.org//~r/Slashdot/slashdot/to/~3/46648777/article.pl"
1: href="
http://rss.slashdot.org/~a/Slashdot/slashdot/to?a=U9Y6Q0"
2: href="
http://rss.slashdot.org//~a/Slashdot/slashdot/to?a=vGm4Eq"
1: img src="
http://rss.slashdot.org/~a/Slashdot/slashdot/to?i=U9Y6Q0"
2: img src="
http://rss.slashdot.org//~a/Slashdot/slashdot/to?i=vGm4Eq"
slashdot provides a good example for what i'd like to call
"idiots who put ads and other dynamic stuff into rss".
obviously rawdog is not at fault here. there's neither a
unique id or anything else to indicate that these two are
actually the same article.
still, there's so much broken RSS out in the wild that i
think it may be worth to add some heuristics to the feed
reader that help to reduce the level of spam?
an effective measure could be to to store a normalized
hash over title and body of each article and use that
to sort out the dupes.
i would imagine the following normalizations,
configurable per-feed (or better yet, a simple
script-filter hook):
- normalize all urls; should catch the double-slashes
- ignore get-parameters on urls; should catch some ad-crap
- strip all <img> tags (or even all html tags?)
- strip custom regex
alternatively and maybe easier: provide a per-feed config switch
along the lines of "use only title as article id". this alone would
probably already fix 99% of the offending sites.
i realize that some (all) of this stuff can possibly be
achieved with a plugin. but my python is "basic" at best,
so it's easier for me to post a little rant here than to
dive into the code and do the work myself. :-)
is anyone else having the dupe-problem?
preferably someone familar with the rawdog code? ;-)
regards,
moe
_______________________________________________
rawdog-users mailing list
rawdog-users@???
http://lists.us-lot.org/mailman/listinfo/rawdog-users