Re-Identification of anonymised data sets

Many people seem that believe that a personal data can be anonymised by just writing over the identifier with asterixis. This is incorrect and exposes both the business or institution as well as the data subjects to major privacy risks.

Useful Definitions

Firstly it is worth considering two terms pseudonymisation & anonymisation again. The difference between these two terms is key to understanding what a company can or cannot lawfully do with peoples data.

Pseudonymisation is defined by GDPR Article 4 (5) as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organisational measures to ensure non-attribution to an identified or identifiable individual”.

Anonymised data is covered by Recital 26 as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable’.

Based on these two definitions the data referred to in the opening paragraph is clearly pseudonymised and not anonymised. The next question then is why does this matter? The answer is that anonymised data is not personal data, so data breaches with the associated penalties can’t occur and it can’t be covered by a Subject Access Request. Whereas pseudonymised data is still personal data and has to be protected, managed and deleted as any other type of personal data. There have been several cases where so called “anonymised data” has been released by companies into the public domain, only to have individuals identified using various re-identification algorithms. The Netflix Prize and Singapore Transit are just two examples. If this data is actually pseudonymised then releasing it may be illegal, if it is truly anonymised data then releasing it is fine.

Re-Identification of the Netflix Prize Dataset

On October 2, 2006, Netflix, the world’s largest online DVD rental service, announced the $1-million Netflix Prize for improving their movie recommendation service.
The data was anonymised by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders. Less than 18 months later on February 5, 2008 a paper Robust De-anonymization of Large Datasets (Ref 1) (How to Break Anonymity of the Netflix Prize Dataset) written by Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin was released.
The researchers de-anonymised some of the Netflix data by comparing rankings and timestamps with a second data set taken from the Internet Movie Database (IMDb). What their paper proved is that this process isn’t hard, and doesn’t require a lot of data. If you strip out the top 100 blockbuster movies everyone watches, what remains is fairly individual and so identifiable.

Using this insight they were easily able to identify a few individuals, with a trivial amount of computer assistance, mostly done with human insight. As they wrote the key question is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analysing the Netflix Prize dataset?”. The question was naturally answered with a yes. One individual (who was not named in the paper) had his political orientation nailed based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. His ratings on two gay themed films rounded out a quite detailed profiling exercise.

The fear is that this same approach would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.

Re-Identification of Smartcity Datasets

In a completely different experiment researchers from MIT’s Senseable City Lab (Ref 2) were trying to understand the how anonymous citizens are in a smart city. They took two anonymized datasets of people in Singapore, one a set of mobile phone logs, the other of public transport trips. Both data sets each contain “location stamps” detailing the time and place of each data point. Then they used an algorithm to match users whose data overlapped closely between each set. After the 1st week they had matched up 17% of the users after 11 weeks they boasted of a 95% rate of accuracy.

For smart cities the implication is that anonymising data set by set may not offer sufficient protection. In other words, as urban planners, tech companies, and governments collect and share data, claims that each individual data set is “anonymised” are no guarantee of privacy. If the individual data sets can be linked that anonymisation can be breached. So much of the investment in smart cities is driven by businesses training algorithms. The data sets citizens generate are of great value to those businesses which is why they are used, this usage must be regulated. We may yet need a contract at city level where data release about us is only done by one centralised office.

Routes to Re-Identification

To quote from Microsoft’s Cynthia Dwork (Ref 3) who warns regulators and businesses not to “look at systems to protect privacy in isolation.” She is very clear that two data sets alone could be totally innocuous, but when brought together a risk could occur. The risk of reidentification from combining multiple data sets is real and exists today.

Working Party 29 highlight (Ref 4) three risks in anonymisation when evaluating anonymisation techniques, these are:
(i) Is it still possible to single out an individual? This would mean it is possible to isolate some or all records which identify an individual in the dataset;
(ii) Is it still possible to link records relating to an individual? Can two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases) be linked. If an attacker can establish (e.g. by means of correlation analysis) that two records are assigned to a same group of individuals but cannot single out individuals in this group, the technique provides resistance against “singling out” but not against linkability;
(iii) Can information be inferred concerning an individual? This is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.

To learn more about the maths behind re-identification,  read this document by the “Privacy Tools for Sharing Research Data project” at Harvard University.


Without a formal analysis of the risk of re-identification, within a wider content of other possible sets, assurances of data anonymity may not be accurate. Data Anonymisation can’t be assessed in terms of single tool or method, as anonymisation takes place within a context. It is only by evaluating the context, in the round, that an authoritative risk analysis can produced a credible and relevant risk modelling. With a credible risk model owners of data set can protect privacy, exploit the dataset while explicitly managing the risks of a privacy breach.

If you have any questions about this this or any Data Protection query please get in touch.

[Ref 1]

[Ref 2]

[Ref 3]

[Ref 4]

Leave a Reply