Every end is a new beginning


TL;DR: I’m shutting down the twitfeeder project (it was on lifesupport for a long time) so I mirrored a technical article from the blog in the hope that it might be useful for somebody someday.

Proxying URL fetch requests from the Google App Engine

Hosting on the Google App Engine means giving up a lot controls: your
application will run on some machines in one of Google’s datacenters,
use some other machines to store data and use yet an other set of
machines to talk to the outside world (send/receive emails, fetch URLs,
etc). You don’t know the exact identity of these machines (which for
this article means their external IP address), and even though you can
find out some of the details (for example by fetching the ipchicken page
and parsing the result or by sending an e-mail and looking at the
headers), there is no guarantee made that the results are repeatable
(ie. if you do it again and again you will get the same result) in the
short or long run. While one might not care about such details most of
the time, there are some corner cases where it would be nice to have a
predictable and unique source address:

  • You might
    worry that some of the exit points get blocked because other
    applications on the Google App Engine have an abusive behavior
  • You might want a single exit point so that you can “debug” your traffic at a low (network / tcpdump) level
  • And finally, the main reason for which the Twit Feeder uses it: some third-party services (like Twitter or Delicious) use the source IP to whitelist requests to their API

To
be fair to the GAE Architects: the above considerations don’t affect
the main usecase and are more of a cornercase if we look at the average
application running on the GAE. Also, having so few commitments (ie.
they don’t stipulate things like “all the URL fetches will come from the
10.0.0.1/24 netblock”) means that they are free to move things around
(even between datacenters) to optimize uptime and performance which in
turn benefits the application owners and users.

Back to
our scenario: the solution is to introduce an intermediary which has a
static and well known IP and let it do all the fetching. Ideally I would
have installed Squid and and
be done with it, but the URL fetch service doesn’t have support for
proxies currently. So the solution I came up with looks like this:

  • Get
    a VPS server. I would recommend one which gives you more bandwidth
    rather than more CPU / memory / disk. It is also a good idea to get one
    in the USA to minimize latency from/to the Google datacenters. I’m
    currently using VPSLink and I didn’t had any problems (full disclosure: that is a referral link and you should get 10% off for lifetime if you use it).
  • Install Apache + PHP on it
  • Use a simple PHP script to fetch the page encoded in a query parameter using the php_curl extension.

To spice up this blogpost ;-), here is a diagram of the process:


A couple of points:

  • Taking
    care of a VPS can be challenging, especially if you aren’t a Linux
    user. However failing to do so can result in it being taken over by
    malicious individuals. Consider getting a managed VPS. Also, three quick
    security tips: use strong passwords. move the SSH server to a
    non-standard port and configure your firewall to be as restrictive as
    possible (deny everything, and open ports to only a limited number of IP
    addresses).
  • Yes, this introduces a single point of failure
    into your application, so plan accordingly (while your application is
    small, this shouldn’t be a problem – as it grows you can get two VPS’s
    at two different data centers for example for added reliability).
  • The
    traffic between Google and your VPS can be encrypted using TLS (HTTPS).
    The good news is that the URL fetcher service doesn’t check the
    certificates, so you can use self-signed ones. The bad news is that the
    URL fetcher doesn’t check the certificates, so a determined attacker can
    relatively easily man-in-the-middle you (but it protects the data from
    the casual sniffer).
  • Be aware that you need to budget for
    double of the amount of traffic you estimate using at the VPS (because
    it needs to first download it and then upload it back to Google). The
    URL fetcher service does know how to use gzip compression, so if you are
    downloading mainly text, you shouldn’t have a bandwidth problem.
  • PHP
    might seem like an unusual choice of language, given how most GAE users
    have experience in either Python or Java, but there are a lot of
    tutorials out there on how to install it (and on modern Linux
    distribution it can be done in under 10 minutes with the help of the
    package manager) and it was the one I was most comfortable with as an
    Apache module.

Without further ado, here are the relevant sourcecode snippets (which I hereby release into the public domain):

The PHP script:

<?php
error_reporting(0);
require 'privatedata.php';

if ($AUTH != @$_SERVER['HTTP_X_SHARED_SECRET']) {
   header("HTTP/1.0 404 Not Found");
   exit;
}

$url = @$_GET['q'];
if (!isset($url)) {
   header("HTTP/1.0 404 Not Found");
   exit;
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

if (isset($_SERVER['HTTP_USER_AGENT'])) {
   curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
}
if (isset($_SERVER['PHP_AUTH_USER']) && isset($_SERVER['PHP_AUTH_PW'])) {
   curl_setopt($ch, CURLOPT_USERPWD, "{$_SERVER['PHP_AUTH_USER']}:{$_SERVER['PHP_AUTH_PW']}");
}

if (count($_POST) > 0) {
   curl_setopt($ch, CURLOPT_POST, true);
   $postfields = array();
   foreach ($_POST as $key => $val) {
       $postfields[] = urlencode($key) . '=' . urlencode($val);
   }
   curl_setopt($ch, CURLOPT_POSTFIELDS, implode('&', $postfields));
}

$headers = array();
function header_callback($ch, $header) {
   global $headers;
   // we add our own content encoding
   // also, the content length might vary because of this
   if (false === stripos($header, 'Content-Encoding: ')
       && false === stripos($header, 'Content-Length: ')
       && false === stripos($header, 'Transfer-Encoding: ')) {
       $headers[] = $header;
   }
   return strlen($header);
}

curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'header_callback');

$output = curl_exec($ch);
if (FALSE === $output) {
   header("HTTP/1.0 500 Server Error");
   print curl_error($ch);
   exit;
}
curl_close($ch);

foreach ($headers as $header) {
   header($header);
}
print $output;


The “privatedata.php” file looks like this:

<?php

$AUTH = '23tqhwfgj2qbherhjerj';


Two
separate files are being used to avoid submitting the password to the
source repository, while still keeping all the sourcecode open.

Now with the code in place, you can test it using curl:

curl --compressed -v -H 'X-Shared-Secret: 23tqhwfgj2qbherhjerj' 'http://127.0.0.1:1483/fetch_url.php?q=http%3A//identi.ca/api/help/test.json


As
you can see, a custom header is used for authentication. An other
security measure is to use a non-standard port. Limiting the requests to
the IPs of the Google datacenter from the firewall would be the ideal
solution, but given that this was the problem we are trying to solve in
the first place (the Google datacenters not having an officially
announced set of IP addresses), this doesn’t seem possible.

Finally, here is the Python code to use the script from withing an application hosted on the GAE:

from urllib import quote
from google.appengine.api import urlfetch

# ...

fetch_headers['X-Shared-Secret'] = '23tqhwfgj2qbherhjerj'
result = urlfetch.fetch(url='http://127.0.0.1:1483/fetch_url.php?q=%s' % quote(url), payload=data,
 method=urlfetch.POST if data else urlfetch.GET,
 headers=fetch_headers, deadline=timeout, follow_redirects=False)


A little more discussion about the code:

  • The
    method of using a custom header for authorization was chosen, since the
    forwarding of authentication data (ie. the “Authorization” header) was
    needed (specifically this is what the Twitter API uses for verifying
    identity)
  • Speaking of things the script forwards: it does
    forward the user agent and any POST data if present. The forwarding of
    other headers could be quite easily be added.
  • Passing
    variables in a GET request is also supported (they would be
    double-encoded, but that shouldn’t be a concern in but the most extreme
    cases)
  • If we are talking about sensitive data, cURL (and the cURL extension for PHP) has the ability to fetch HTTPS content and to verify the certificates.

While
this method might look cumbersome, in practice I found it to work quite
well. Hopefully this information will help others facing the same
problem. A final note: if you have questions about the code or about
other aspects, post them in the comments. I will try to answer them as
fast as possible. I’m also considering launching a “proxy service” for
GAE apps to make this process much more simpler (abstracting away the
setup and administration of the VPS), so if you would be interested in
paying a couple of bucks for such a service, please contact me either
directly or trough the comments.


Leave a Reply

Your email address will not be published. Required fields are marked *