You've Been De-Anonymized

The supposed "anonymity" of users in behavioral tracking systems has been challenged repeatedly by privacy advocates over the years. One way or another, even when a non-personal information identifier is applied to a user, it is possible (possible, mind you) to track back at least to that person's IP address and perhaps to sensitive PII. In a new academic paper from two University of Texas scholars, Arvind Narayanan and Vitaly Shmatikov, computer science researchers are turning their eye towards defining "anonymity" and "privacy" online.


"De-anonymizing Social Networks" is a turgid read to be sure, and our eyes glaze over when they start throwing in equations. But this is the sort of academic scrutiny of privacy that is getting funded (partially by National Science Foundation grants in this case) and will form a bedrock of research about the technical possibilities around identity protection and theft. The outlines of this research are worth reviewing because they reveal the contours of what privacy protection policies will have to address in dealing with social networks.



The basic approach taken by these University of Texas researchers involves overlaying anonymized data from multiple social networks in a way that ultimately reveals identity. The paper does not argue that PII is available directly to third parties, advertisers or malicious hackers, but that anonymized social graphs that are available can be combined to render identities.

The researchers claim that anonymized social graphs are available from many social networks to academic and government data mining, advertising, third-party applications developers and network aggregators. For instance, the paper cites a few instances where widget makers and application developers exposed sensitive information about users on the network. Whether it is through available APIs into a social network or just profile scraping "it is important to understand what a malicious third party application can learn about members of a social network even if it obtains the data in an anonymous form," the researchers argue.

The aim of the research was to determine whether sensitive information about individuals could be extracted from "anonymous" social graphs. The reserarchers used social graph data from Twitter and Flickr, whose APIs they say include mandatory username fields and optional name and location fields. Also included in the research was a graph of friend relationships from the LiveJournal blogging service. They used a series of algorithms (which make our head hurt just thinking about them) to find common people across two networks and then mapped out from there the social graphs to find common cliques.

We wish we could say with some high degree of confidence how these guys managed the trick of de-anonymizing social network members via algorithms, but we confess to some measure of befuddlement over the equations behind the test and the opaqueness of the terminology. Once the basic matches were established, however, the researchers claimed that in 30.8% of the cases their algorithms correctly re-identified previously anonymized members of the social graph. We're still not clear from the paper what the level of identification of individuals was in this experiment.

The implications of this research are twofold, at least. First, the researchers seem to be demonstrating that simple algorithms can be applied by just about any advertiser or developer on readily available anonymous social graphs to render identities. The test claims to have successfully "re-identified" thousands of Twitter users simply by cross-referencing anonymized social graphs with Flickr. In fact, these researchers said their assumptions in making the algorithms were conservative and that in the real world overlap of massive networks like Facebook and MySpace it should be much easier to re-identify higher percentages of common users, if only because so many people readily identify themselves on one or both of the networks. As the social networks get larger and overlap their membership, the research argues, it will become much easier to thwart anonymity and to do so easily. Even anonymized data contains attributes that can make the process easy.

The researchers feel that the potential for abuse is clear and present. Any potential solution would appear to necessitate a fundamental shift in business models and practices and clearer privacy laws on the subject of Personally Identifiable Information."

Perhaps more to the point, the researchers are arguing that "anonymity" online does not equal privacy. In fact, they seem to be proving privacy can be breached on social networks at least fairly easily.

At their clearer online FAQ on the research, the team says it does not expect a technology solution to the problem. "First, the false dichotomy between personally identifiable and non-personally identifiable information should disappear from privacy policies, laws, etc." the team says in its online FAQ. "Any aspect of an individual's online personality can be used for de-anonymization, and this reality should be recognized by the relevant legislation and corporate privacy policies."

Just as the lines between PII and non-PII have themselves become blurred, the associations between anonymity and privacy growing more are complex. As Jules Polonetsky of the Future of Privacy Forum tells us in response to this new research, "It's important to understand that personal and anonymous are not black and white. There are gradients. Any time there are a number of pieces of data about one user, there is an increasing potential they can be identified. For most users it's rather unlikely that anyone will go through the trouble, but it is important to understand that as you build an electronic trail, identification becomes possible."

Hearty readers and cosine freaks can download the original paper or view an online version at

2 comments about "You've Been De-Anonymized".
Check to receive email when comments are posted.
  1. Warren Lee from WHL Consulting, April 3, 2009 at 1:59 p.m.

    Well that's surely a surprise: Use a bunch of tax payers money for a study that seems self evident at the outset. It is obvious that given enough anonymous information, there are math wizards and algorithms that are able to match a minority, abet a large one, of identities. The issue for me is the privacy policy that lets the information out in the first place. Now is there any harm done to anyone? Now consider:

    Since there is no such thing as privacy in the offline world, why prey-tell is the online world held to a different standard? I think this is a major question that needs to be addressed. If I go to a DM firm, offline, I can select female breast cancer survivors, over 50 in the following zip codes who have a HHI of over $150k and etc. Online if I were to try that I think that it might start an inquisition! What is the justification for the disparity?

  2. Bruce May from Bizperity, April 3, 2009 at 2:13 p.m.

    Wow... and I thought that no would ever find out about my secret fascination with Galactic Clusters and radical string theories... oh well, I guess you might as well know everything about me.... The bottom line is simple, if I put my real name on any network then you can find out everything there is to know about me. Since most of us are on Linkedin I guess we have already lost the war. The government can outlaw this kind of data acquisition but once the data is sold to a third party you have the problem of tracking that transaction and we know how difficult that is (especially across national boundaries). Since I don't want the government to tell me what information I can and can't share on a social network (that would be unconstitutional) it looks like I am assuming all the risk. Since I don't see any technical way to control it (and neither do the authors of this study) what can the government do anyway? It looks like a perplexing problem just got a lot more perplexing.

Next story loading loading..