It is no secret that the Web experience has evolved to include a great deal of tracking by various (third-party) sites to classify each visitor. The most obvious reason for tracking and classification is to target ads to the appropriate audience. However, it is also apparent that tracking can be used by government agencies to detect perceived threats and by criminals and hackers to steal, stalk, and create havoc.
The bottom line is that if dozens or even hundreds of unknown sites are tracking you in the shadows of the Internet, then you have lost control of your privacy and maybe your security, too.
The extent of identification depends on the information you share. It's indeed possible that a third-party site can know a great deal about you in addition to the sites you visit and what you do while there. And sometimes that third party was at one point the primary site (e.g., google.com).
In this case, a site can build an extensive classification of you including your background, associates, habits, likes & dislikes, and maybe even your intentions.
In response to the state of the Internet and its condition of pervasive undesirable tracking, a typical modern browser includes an option to instruct sites to not track its particular user. If the Do Not Track (DNT) feature is enabled, each HTTP web request sent via the browser includes a field that directs the web server to not track its user.
A typical exchange of a request and response that includes the DNT option is shown below (note that content removed is denoted with an ellipsis '...').
An example Request from the browser:
GET / HTTP/1.1 Host: www.whitehouse.gov User-Agent: Mozilla/5.0 ... Firefox/29.0 ... DNT:1 ...
HTTP/1.0 200 OK Date: Wed, 04 Jun 2014 12:40:23 GMT Server: WSGIServer/0.1 Python/2.7.3 ... Set-Cookie: webcookie=randomstring; expires=Wed, 03-Jun-2015 12:40:23 GMT; Max-Age=31449600; Path=/ <!DOCTYPE html> <html> <head> ...
DNT may seem like a good solution, but testing shows it's about as effective as wearing a "don't rob me" sign while counting cash in public.
Third Party Testing
Our testing consists of the following:
- Identify the extent of sites tracking us by using the FireFox Lightbeam plugin as we visit seven sites in a particular (repeatable) order.
- First visit the seven sites with Don't Track (DNT) disabled and then visit them again with DNT enabled and record & compare the results. Use both the Lightbeam data and our local FireFox cookie database to determine the impact of the Don't Track option.
- If finding the Don't Track option is NOT effective, then utilize a proxy and content filter to block requests to third-party tracking sites.
The seven sites we choose to visit along with tracking results are shown below in Tables 1 and 2. Table 1 summarizes the data reported by Lightbeam, and Table 2 summarizes the number of third-party cookies set in our browser. Each column indicates the number of third parties detected as an independent result and also the cumulative result: independent/cumulative.
For example, visiting weather.com independently results in 43 third-party sites being detected. Visiting wsj.com independently results in connecting with 75 third parties. However, there are only a total of 90 third-party sites reported after visiting weather.com and then wsj.com. The difference between the sum of independent results with the cumulative result (43+75-90) is correlated with the extent of third-party tracking: 28 sites are shared between the two domains and many of them are third-party tracking sites.
|Web site||No Preference||Don't Track|
|Web site||No Preference||Don't Track|
Also, for this particular example (wsj.com + weather.com), the Lightbeam graph is provided below and shows how third-party sites track users across both weather.com and wsj.com. Lightbeam is an amazing tool and conveys its data in both list and graph format. In this graph depiction, circles are the primary sites we visit (e.g., wsj.com), and triangles are the third-party sites that we unintentionally visit. Note that not all third-party sites are shown in the graphical display.
Returning to Tables 1 and 2 we can see that setting the "Don't Track" preference had virtually no impact in the cumulative number of third-party sites tracking us and the number of third-party cookies being set. For all we know, third-party sites are even using the "Don't Track" preference to classify us further.
The bottom line is that after visiting seven sites, we have nearly 150 third-party sites potentially tracking us along with 250 third-party cookies regardless of our Don't Track preference.
To further illustrate the results, Figures 2 and 3 below show the Lightbeam graphs at the end of visiting our seven sites with and without the "Don't Track" preference set.
And before we go on to discuss cookies and then filtering & blocking at the network level, it's important to point out a few things:
- Not all cookies are bad. They enable your content provider to do useful things like remember your state and that you have logged in (authenticated yourself).
- The results provided here should be repeatable in general, but it has been shown to be extremely unlikely to get the same number of tracking sites or cookies during repeated trials. It appears that most sites and / or ad engines change their particular ads on a frequent basis.
- Both weather.com and maximintegrated.com did exhibit repeatedly a lower number of third-party connections and cookies with Don't Track set.
- Some sites can exhibit wild swings from vist to visit. For example, visiting maxim.com triggered over 100 third-party connections during one independent visit.
- An independent visit implies that the browser's history was cleared of everything and the Lightbeam data was reset before visiting the site and collecting the data.
- There are other ways for a web site to track its users including tracking the IP address of the visitor and a signature conveyed in request headers (e.g., User-Agent)
Third Party Cookies
The previous page mainly focused on Lightbeam data. However, this page will focus on the issue of the cookies themselves. For convenience, we again show Table 2.
|Web site||No Preference||Don't Track|
The first number in each column is the number of third-party cookies set for an independent visit, and the second number is the cumulative number of third-party cookies set: [ independent / cumulative ]. The extent of tracking is correlated with the difference of the two numbers: comparing the sum of independent results with the cumulative result. For weather.com and wsj.com, the difference is 139+38-155=22. In other words, 22 cookies have been set that are shared between the two sites (with no tracking preference set).
The overall result is that regardless of our Don't Track preference, at the end of visiting our seven sites, we have over 250 third-party cookies set in our browser.
There are multiple ways to determine the cookies being set in the browser by third parties. For our study, we have chosen to work directly with the mozilla cookie database: moz_cookies in the cookies.sqlite file. This file can be found within the ~/.mozilla/firefox folder on Linux and ~/Library/Application\ Support/Firefox/ on MacOS. The file can be read using the sqlite3 command line tool.
Before accessing the cookies.sqlite file, we quit Firefox and copy the cookies.sqlite file to our local working directory. The example below shows how to write out the third-party cookies to a csv file after visiting weather.com and wsj.com.
$ sqlite3 cookies.sqlite SQLite version 3.7.9 2011-11-01 00:52:41 ... sqlite> .mode csv sqlite> .output 2_t.csv sqlite> select * from moz_cookies where baseDomain!="weather.com" and baseDomain!="wsj.com";
If we had only wanted to display the number of third-party cookies in the moz_cookies table, then we could have replaced select * with select count(*):
$ sqlite3 2_t.sqlite SQLite version 3.7.9 2011-11-01 00:52:41 ... sqlite> select count(*) from moz_cookies where baseDomain!="weather.com" and baseDomain!="wsj.com"; 155
The description (schema) of moz_cookies is provided below. For our study, we're mainly interested in the baseDomain column.
.schema moz_cookies CREATE TABLE moz_cookies ( id INTEGER PRIMARY KEY, baseDomain TEXT, appId INTEGER DEFAULT 0, inBrowserElement INTEGER DEFAULT 0, name TEXT, value TEXT, host TEXT, path TEXT, expiry INTEGER, lastAccessed INTEGER, creationTime INTEGER, isSecure INTEGER, isHttpOnly INTEGER, CONSTRAINT moz_uniqueid UNIQUE (name, host, path, appId, inBrowserElement) ); CREATE INDEX moz_basedomain ON moz_cookies (baseDomain, appId, inBrowserElement);
Below we process our cumulative Don't Track data base. We output a file of distinct third-party domains.
sqlite> .output tracking_cookies1.txt sqlite> select distinct baseDomain from moz_cookies where baseDomain!="wsj.com" ...> and baseDomain!="weather.com" and baseDomain!="cnn.com" and baseDomain!="maxim.com" ...> and baseDomain!="maximintegrated.com" and baseDomain!="disney.com" and baseDomain!="foxnews.com";
The "tracking_cookies1.txt" file is provided below. Keep in mind that this is a list of each distinct domain that set at least one cookie in our browser while visiting our seven sites. There are 94 domains listed, and the majority of these third-party sites set multiple cookies in our browser.
At this point you may be in disbelief that 94 sites that you unintentionally visited set (tracking) cookies in your browser after visiting only seven sites with Don't Track set.
There are solutions to this problem. You can choose to disable cookies entirely; however, you probably won't like the result. In some browsers, including Firefox, you can set a preference to not allow third parties to set cookies. However, keep in mind that these third-party sites can still track you by your IP address and setting an individual browser isn't a great solution when you frequently surf the Web from multiple machines and browsers.
Content Filtering to Counter Third Party Tracking
So far, we have shown that extensive third-party tracking occurs as we travel across the Internet regardless of how we set the Don't Track preference in our browser. Now armed with a list of domains that potentially act as third-party tracking sites, we turn to a content filter within our network to block third parties from tracking us.
The data flow is depicted in Figure 4. Our local network controller blocks undesired web requests utilizing its database and instructs the local web server to respond with a blocking message rather than forwarding the request onto the Internet. This prevents the third-party site from receiving the request and subsequently setting third-party cookies in our browser.
At this point, we will discontinue using Lightbeam to gather data and just rely on analyzing the cookie database. This is because Lightbeam won't adequately discern our local web server's response from a true third-party response.
For this test, we disable the Don't Track preference in Firefox.
The results using our content filtering database is shown below. We now have 27 third-party sites setting cookies in our browser. Note that a few of them are probably desirable (e.g, turner.com and dowjoneson.com).
As a final step, we will add the majority of this list to a new branch in our content filter database tree and re-visit the seven sites one final time.
The results are shown below. Only eight third parties are now setting cookies in our browser. Remember that we initially had 94 third-party sites setting cookies in our browser.
- Testing shows that third-party tracking is indeed pervasive, and the Don't Track preference appears to show little to no positive effect.
- Centralized content filtering is an effective way to either drastically reduce or eliminate third-party tracking. However, its success depends on maintaining a current database of third-party tracking sites.
- Both the Lightbeam plugin and the cookie database are great collections of data to identify the third-party sites performing tracking.
- Blocking third-party tracking sites does not negatively impact the browsing experience. In fact, it may be more pleasurable since ads are often distracting. Figure 5 shows a screen shot of wsj.com with centralized content filtering enabled.
- Blocking requests to third party sites can yield a more responsive experience since the browser doesn't need to make the dozens (if not hundreds) of additional third-party web requests for each page visited.