Tuesday, August 10, 2010

Can we count users without uniquely identifying them?

Aaaah
Hi all. I'm just back from a rather nice holiday. Well, technically, I'm still on holiday, but there were a few things I wanted to take care of, so I popped in for a few hours of work yesterday and today. I saw that there was this post on Phoronix that triggered me writing a post that I've been meaning to do for the last few weeks, since the Canonical Platform Team got together in Prague three weeks ago, to be exact.

Pre-installed desktops ftw
One of the roles of Canonical relative to Ubuntu is to get Ubuntu pre-installed on as many computers as possible. This is one of the dreams of the Linux desktop. Pre-installs mean end users don't have to fiddle with configurations, installing drivers, etc... (at least when done well) and the users can make an apples to apples comparison between their free desktop and proprietary systems that normally come pre-installed.

Canonical does this by working with OEM customers. OEMs are companies that sell assembled computers to people. One of these customers asked Canonical if there was some way that they could know how many computers that they send out with Ubuntu on them keep Ubuntu on them. The customer's engineer came up with a system where they would create a unique identifier for each Ubuntu computer they sold, and then when the computers requested update info daily, it would send that unique identifier with it.

The customer didn't really want to use a unique identifier though, because though it was anonymous, the customer wanted to *count* computers, but unique identifiers are for *tracking* (following a user over time). We mulled it over and over, and finally, based on our experience with web browsers we hit upon a system of non-unique channel identifiers to do the counting. This would make tracking impossible, but of course, tracking is not the goal, counting is.

Non-unique channel identifiers
So, we flashed on this: if each install sent just the model name and the number of times it has updated, systems could be counted, but no unique data would ever be sent to the server. Now, I am not a mathematician, so each time I try to explain why I think this works, it takes me a while. But in the end, everyone is convinced. In fact, Matt Zimmerman ended up writing a test program to prove to himself that it worked. Let me try, stick with me here ...

Every day each computer from the customer sends it's model name and the number of times it has already sent this data to the server. So if a model of a computer is called, say "foo", the first day it sends "foo" and 0 to census.canonical.com. After sending the 0, the computer remembers that it already sent a 0, so it will send a 1 next time. When the server sees the foo.0 in the log data, it essential stars a new counter for the model foo. The total number of foo.0 are the total number of the model foo ever activated.

Take one of those foo computers. The next day it will send foo.1, saying "this is a computer of model foo, and this is the 2nd time it has pinged that it's alive". Notice that neither foo or the number 1 are unique data. Any number of computers will be reporting the exact same model name and increment number. When the server sees a 1 come in, it finds the first counter at 0 and increments that counter to 1. Now it knows the total number of computers ever activated (all the counters), and it can count all the counters that were incremented in a day and thereby know how many computers were online that day.

Future?
Currently this system is only slated to be used by the specific OEM customer who requested it, and it will be up to the customer to disclose the data they collect as they wish. I wonder if it would be a good thing to install on normal ISOs though, but this would be part of our normal participatory community decision making process. Projects like this make think that users would like to be counted, so long as they can't be tracked. We'll see how it plays out, it may be something to discuss at UDS if the community feels the data would be useful.