Getting full contents for partial feeds

In my opinion partial feeds are not feeds. While I understand the need to get pageviews, I don’t like it. My time is valuable and I don’t want to hop between Google Reader and other browser windows to read the content. Disclaimer: this method might or might not be a violation of some laws, TOS, etc. IANAL. Use this method at your own risk.

The method: use Yahoo! Pipes to fetch the HTML page for each entry. The setup can be seen below:

The feed used in the example is the old feed from the Truested Source blog (since they’ve been bought by McAfee, they publish a new feed with the complete posts ;-)). The first operator (the Loop) fetches each page specified in the link for the element. Two remarks:

site owners can prohibit Yahoo Pipes from fetching pages using robots.txt
pages are not fetched at each evaluation of the pipe, rather at each change of the source feed (for those who are worried about a pipe DDoS-ing the site)

Set the “Cut content from” and “to” so that what you obtain the HTML part you want. “Split using delimiter” must be set to something, preferably something which doesn’t occur in the text. I just used some random MD5.

The second loop tries to protect against XSS-ing yourself :-). I discovered this by accident, because the feed contained the following post: A Little Filtering Can Halt Some XSS Attacks. The problem is that the inserted HTML content gets double decoded, resulting in execution of the script, even if it was encoded properly for the HTML page. The method used in the above example is rather lame, hower there is a good news: Google Reader disallows Javascript so you are not at risk, even without this transformation.

Enjoy your full feeds.

Update: Originally I came up with the idea while reading the article Build a Web Page Monitor with Google Docs and Track Changes Automatically, however the Yahoo Pipes solution is much cleaner, task oriented solution (but the Google Docs one is still worth checking out for other possible usecases).

Update: two alternative solutions (which are easier to use than creating a custom pipe for every feed) – via taint.org:

February 13, 2009

gpanther

blog, hack, rss

3 responses to “Getting full contents for partial feeds”

Darío says:

July 29, 2009 at 6:44 pm

Great article. Do you know how I can parse a entire google result page?. The yahoo pipes claim "Can't fetch pages that robots.txt disallow".

Thanks

Reply
Cd-MaN says:

July 30, 2009 at 7:19 am

@Dario: yes, Yahoo Pipes respects the robots.txt file, and it is very good that it does. As for getting back search results: I would recommend the Google Search AJAX API's, since it is officially supported and it returns the search results in a machine-readable format (JSON), which saves you the hassle of writing a parser and living in fear that the page format changes and your parser breaks.

The disadvantage is that you must (as per the TOS) use the results on a webpage (ie you can't do mass requests for back-office processing).

Reply
Dario says:

August 7, 2009 at 8:27 am

Thanks for the reply. I'm develop a app to get the ads from the google result page, so, it isn't possible at all with pipes or ajax api. I will continue making some multi server requests 😉

Reply

Grey Panthers Savannah

Getting full contents for partial feeds

3 responses to “Getting full contents for partial feeds”

Leave a Reply to Cd-MaN Cancel reply