Tracking web users


Again, this will be something new here (at least for me): I’ll publish a pre-rant for Security Now! Steve Gibson expressed interest in the subject of cookies, so I’ll tackle that in this post and also the more general question of user-tracking. I discuss different ways it can be accomplished, ways you could protect yourself and the question: should you?

In a way the World Wide Web is a marketing companies wet dream: just image, tracking the moves of the users, building a profile which lists their potential interests (as it can be inferred from the list of visited sites and the frequency of the visits). Using this they can show ads which they consider will be relevant to us. Of course they don’t do this out of the goodness of their hard. They do it because you have a higher probability of reacting to the advertisement if it’s relevant to you.

Here are the means I know of which can be used to accomplish this:

  • Tracking cookies or third party cookies – this is IMHO a bad name (from a technical point of view), and I’ll explain in a minute why. But first lets answer the question: what are cookies? Cookies (or HTTP State Management Mechanism as it is referred to by the official RFC) are opaque tokens (from the point of view of the client) which contain some information which helps the server side application identify the fact that different HTTP requests are part of the same session. This is necessary, since the HTTP protocol does not define any method for creating, tracking and destroying sessions. That is, whenever you request an object from the web server it will treat it as separate request, having no idea what you requested earlier. The cookie is used as token in the following way: the server says to the client take this piece of information and return it to me on subsequent requests. This way it can determine if the request is part of the same session (because it can hand out a different value to each client and when the client returns the information, it can identify the session it is part of). Before you ask: you can’t use IP addresses as a reliable unique identifier because of proxies and NATs. You can observe two things here: this behavior is entirely voluntary on the clients part (it may choose not to return the token) and that it applies to every HTTP transaction, not just HTML documents (including images, flash animation, java applets, etc). Of course the standard defines a policy which specifies in which requests should the cookie be returned. The elevator speech version of this is: cookies will only be sent back to requests targeted at the server it was originally sent from and to elements the path of which is prefixed by the path contained in the cookie (for example if the cookie was set by the object located at http://example.com/set/a/cookie it will be sent in all requests which are targeted at the example.com server and contain in the url /set/a/cookie). Now how is this used to track you from site to site if the cookie is only returned to the server it was originally sent from? Enter the advertisement companies: they serve up ads from the same server to many webpages. This means that those webpages contain links to elements (usually images, flash animation or javascript) which reside on the server of the advertiser. This means that if you view a page which contains advert from a given company, it can set a cookie, which will later be sent back to it when you view an other page (possibly from an other server) which contains advert from the same company (because in both cases the object – image, flash, whathever – came from the same source the cookie was set – the server of the advertiser). This is called a third party cookie because it is set by a different entity than the server you see in your address bar. However I think that this is a bad name since it implies that some kind of spoofing is going on, like a server is setting a cookie for an other server – which by the way is explicitly prohibited by the standard and won’t work in any modern browser. To sum up:
    • Applicability: (almost) every browser supports it. The standard itself if relatively old (almost 10 years)
    • Customizability: Current browsers offer ways to set a policy on what cookies should / should not be accepted both in a whitelist and blacklist format. Usually they do not include the option to view the cookies stored on the machine, but there are many free third party tools / extensions which enable you to do this.
    • Risk of disabling it: if cookies are disable altogether, many sites which have a member-only area will break and the user will be unable to log-in. Disabling of third party cookies breaks pages which host elements fetched from a third party server (which represents a small but growing percentage of the web in the age of mashups)
  • Flash Local Shared Objects (AKA flash cookies) – As of version six (also called Flash MX) a feature was introduced in the Flash Player to store information which had to preserved across different page loads locally on the users computer. Before that sites used a combination of javascript, cookies and actionscript to obtain the same effect. Flash Local Shared Objects have the same restrictions as cookies for forwarding (i.e. they’re only sent to flash movies which originate from the same server). Because this was a little known feature outside of the Flash developer community and the interface was hidden and because of the scaremongering many users started to remove or disable cookies, advertisers started to use it instead of cookies.
    • Applicability: on any platform which has at least version 6 of the Flash Player installed.
    • Customizability: you can go to the site of Adobe to completely disable or to manage the shared objects which are on your computer. There is also a Firefox extension, however it seems dated and not maintained any more, so probably the safest bet is to go with the official links provided above.
    • Risk of disabling it: sites which rely on it may break, however I didn’t found any sites until now which relied on it for other purposes than tracking, so currently it may be disabled without any problems. This may change in the future however.
  • Referrer URLs – Referrer URLs is a piece of information sent by your browser when requesting an object from a web server. For example if you click a link at http://foo.com/link.htm which takes you to http://bar.com/target.htm, the bar.com webserver will receive as part of the request (if you didn’t disable it in your browser) the string http://foo.com/link.htm as the referrer. This can (and is) used by sites for statistical purposes (to see who links to them) and for security (however this is a pretty weak form of security since it relies on the client playing it straight and thus it can be spoofed. One thing which makes the privacy advocates suggest to turn this feature off is the fact that if you go to a page from a search engine (that is, you searched for bar.com on google and then clicked on one of the results), the target server can know the words you searched for (since it will be embedded in the referrer url). However, this information isn’t forwarded to the advertisers unless the use third party javascript to get it (which I’ll talk about later on). That is if you go: Google -> Google search results -> foo.com -> (automatically, because it is embedded in the page at foo.com) advertiser. The referrer transmitted at the last step (that is from foo.com to the advertiser) if foo.com (meaning that the only information that the advertiser gets is the fact that the ad was loaded from foo.com, not the way by which the user arrived to foo.com. I want to stress this because Steve Gibson got this wrong on episode 64 of the Security Now podcast. (I want to stress again that advertisers can get the referrer of the page which includes the advertisement by using third party javascript which I’ll talk about shortly).
    • Applicability: on almost every browser
    • Customizability: you can see a tutorial about enabling it here which should point you in the right direction.
    • Risk of disabling it: you shouldn’t encounter any problems because few sites use it for other purposes than statistics, but if you don’t mind, give them this piece of information, it can be used to create better content for you!
  • Third party javascript – usually when a site collaborates with a given advertiser, it is asked to put a piece of HTML in every page where s/he want the ads to be displayed. This code is usually an IFRAME tag or a SCRIPT tag. In the later case we talk about third party scripts – javascript code which is provided by a third party and runs in the context of the current page. This code can do almost everything, including the following things: access the referrer of the current page (so even if it isn’t directly relied to the advertisement server, the script can forward it), get information about the browser capabilities (screen resolution, etc) and perform history digging (see the next point).
    • Applicability: on every browser which understands javascript.
    • Customizability: in Firefox you can use the NoScript extension. In Internet Explorer you can add the sites you want to block scripts from in the Restricted Sites Zone. An other solution would be to disable javascript entirely, but this will reduce the usability of many sites.
    • Risk of disabling it: mashups use heavily third party javascript (to embed Google Maps for example). Also some big sites host their script files on different servers than the content (to be able to optimize the servers for the specific types of files), so you can’t say generally that everything third party is bad.
  • History digging – This is a really cool technique, reported first as far as I can tell by Jeremiah Grossman and was later tweaked to work with IE. It is based on the fact that visited links have different styles than non-visited links (this is usually observed as different colors). If you put a bunch of links on a page and then use javascript to inspect the styles applied to them by the browser, you can tell if the given sites are in the history of the browser.
    • Applicability: there is proof of concept code for Firefox and IE. It should work in any browser which has a standard conformant implementation of javascript and DOM.
    • Customizability: you can’t programatically disable just this feature. Your options are: (a) disabling javascript (b) cleaning your history before you visit sites you suspect are doing this. One important fact: if an advertiser embeds javascript on the site the ad is displayed on, it can use this technique to find out if you visited a given site. Fortunately there is a mitigating factor: in order for somebody to find out if you visited a given page s/he has to know the exact url of the page (that is this method can not be used to enumerate the entries of your history)
  • Sign-in information – an often overlooked fact by people is that the big three identity providers (Google, Yahoo and MSN) also provide advertising. Because of this they can correlate tracking information obtained by any of the methods listed above with the personal information you provided at signup. Now I’m not saying that they do this, I’m just saying that they have the technical means to do it.
    • Applicability: if you are a user of any of these sites and browse sites – while you are logged on – which display advertisement from them, you are affected.
    • Customizability: log off before browsing to other sites and clear all the cookies from them. Before logging back in also clear the cookies from them placed there by the ads.
    • Risk of disabling it: the inconvenience of constantly having to clear cookies.

Now for the philosophical question: should you be worried? Should you go to great length to avoid this tracking, even at the cost of breaking useful features on the site? You should consider the following ideas (they are not absolute truths, but arguments which are used in this debate):

  • Nothing is free and advertisement is an (arguably) quick and (mostly) painless way of payment for the content / service. So disabling advertisement can be thought of as a way of cheating to get what you desire without payment)
  • Contextual ads can be useful. For example if I would like to buy a laptop and I see an ad for laptop, I will most probably click it. This is useful for both parties: for me because possibly I learn about an offer I didn’t know about and for the company who put out the ad, because I might buy something from then.
  • Some people say: but this is not right! The user should be in control! If you want to buy laptops, search for them yourself! Of course no rational person (no offense to anybody) would buy something of significant value based on one ad (because usually it’s only showing one detail of the product – probably not mentioning the not-so-bright sides) but it may add value to your research. So, while you shouldn’t buy based on what they say on the teleshopping channel – err I mean ad 🙂 – it may add value to your research while you are considering your options.
  • The tinfoil hat people may say: I don’t want the government / Amazon / Google / whatever track my every movement! I have a right to privacy! – and they are right, they do have a right to privacy, however they must be willing to give up certain benefits or to make some additional steps. And before you object saying: why do I have to make extra efforts to get the same service everybody receives while keeping my information as private as possible? – just consider how things work in the real world – if you want to drive a car, you must get a license. It is your right to drive a car (if you are of legal age), however you still have to get a license. Because every analogy breaks down, lets consider the technical point of view: every technology can be used for good an bad (this is even more so if there is no clear distinction between good and bad). The only way of preventing 100% of the bad usages of a technology is to ban it all together. You may choose this, but be aware that you are not getting the benefits either. Now some of the technologies (like session cookies) can be emulated by other technologies (like appending the SID – the session identifier to every request as a GET parameter), however the given technology was introduced to make it easier to accomplish certain tasks without the complication and hassle the old method needed. Guess, what a rational website owner / creator would do: use the more complex, less reliable and more expensive technology for a very little percent of its visitors or go with the easier and more powerful technology?

One response to “Tracking web users”

  1. Hi,

    Very nice info indeed, Thanx !

    As requested i do have some questions/thoughts etc, which i’d like answers to.

    Malicious Cookie Exploits !

    I’ve been to establish for some time, with not much help or success, if it’s at all possible for Cookies to manipulated maliciously. In other words ” could ” code, of Any decription, be inserted/injected etc by Any means into a Cookie ? Furthermore, could this then be launched/run etc, either directly and/or indirectly in some way/s ?

    I’m not automatically excluding Anything in this scenario, it might be JavaScript or a mixture of different code, and/or whatever it takes to make it work !

    Here’s a brief selection i’ve found on the subject –

    inserting malicious content into a cookie – http://www.cert.org/tech_tips/malicious_code_mitigation.html

    the cookie may be modified by the attacker to include
    malicious code. – http://www.ciac.org/ciac/bulletins/k-021.shtml

    it is easy for a client to alter their cookie to allow inclusion of malicious content or send bogus information in their HTTP requests – http://www.peej.co.uk/articles/cross-site-scripting.html

    What Are The Chances of Catching a Virus From a Cookie? – http://www.cookiecentral.com/c_virus.htm

    There was another i discovered the other day on i believe Secunia, which was a Vulnerability in either Windows and/or IE, that showed this was not only possible, but Actually has happened ! I thought i’d be able to locate it easily again, but couldn’t when i tried today ? I’ll try and find it again if i can.

    Thanx in advance for Any light/info/links etc you can shed on this.

    Regards,

    Spanner

    SpannerITWks

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *