These days we should write every code as if it will be used by international people with a wide variety of personal information (just look at Falsehoods Programmers Believe About Names for some headscratchers). I would like to do add my small contribution to this by showing how UTF-8 encoded strings can be passed into GET/POST parameters.
For this I’ll be using the following small PHP script, which can be quickly run by the command line PHP webserver added in PHP 5.4:
<?php header('Content-Type: text/html; charset=utf-8'); ?> <code><pre> GETs: <?php print_r($_GET); ?> POSTs: <?php print_r($_POST); ?> </pre></code>
We’ll test this with the following Python script:
#!/usr/bin/python # vim: set fileencoding=utf-8 : import urllib import urllib2 params = {'name': u'東京'} params = { k: v.encode('utf-8') for k, v in params.iteritems() } data = urllib.urlencode(params) url = 'http://localhost:8000/?' + data req = urllib2.Request(url, data) response = urllib2.urlopen(req) print response.read()
This all works well and nicely, so here are some conclusions:
- GET and POST variables need to be UTF-8 encoded after which they need to be urlencoded (“% encoded”). See this StackOverflow answer.
- Based on the same answer: hostnames use Punycode instead (but we are not concerned with hostnames here)
- You might need to add the following header for POST requests to work: “Content-Type: application/x-www-form-urlencoded; charset=UTF-8”
- Failing to observe this sequence leads to an UnicodeEncodeError in urllib.urlencode