Anonymous data ... or not!

Personal data are often shared anonymously. But we know that it’s possible to reidentify people. ICTEAM researchers have just developed a model that accurately estimates the probability of a person being correctly identified.

The use of the Internet or mobile applications generates billions of data. Demographic, social, medical or economic, they have become a rich resource that can be used by researchers and companies alike, at the risk of violating everyone’s privacy. Of course, in Europe, the General Data Protection Regulation (GDPR) is supposed to protect citizens. But as Luc Rocher, an FNRS candidate and PhD student in the UCLouvain ICTEAM, explains, ‘Anonymised data obviously don’t require the consent of the person concerned. The GDPR doesn’t apply to this type of data.’ Thus they’re widely shared around the world.

Anonymisation

When an organisation has personal data that it wants to share (most often via selling to a user), it practises what is called data anonymisation, first by de-identification (names don’t appear and some features are eventually modified), then by sampling. The latter technique consists of delivering only some – say, 10% – of the data to the user. It’s even a major counter-argument in case of reidentification, because we have known for some years now that such precautions prevent journalists and researchers from identifying the person behind supposedly anonymous data. Or at least that’s what such 'crackers' of anonymity were more or less persuaded of without ever being completely sure. Suppliers had a strong case to argue, ‘You claim to have recognised such a person, but you only have 10% of the person’s data, so what tells you that in the remaining 90% – and more generally in the rest of the Belgian population – there’s no other person with the same characteristics?’ Irrefutable, indeed. Till today. Enter Mr Rocher and his colleagues Julien Hendrickx, an ICTEAM professor, and Yves-Alexandre de Montjoye, head of the Computational Privacy Group at Imperial College London.

99.98% probability

‘What we’ve developed,’ Mr Rocher explains, ‘is a model that allows for estimating if the person’s reidentification is correct.’ To achieve this, the researchers developed an algorithm, equipped with a small collection of data from a population of a few thousand people, that learns little by little which features are more unique, more distinctive than others. For example, being 100-years-old is obviously a more distinctive feature than being 30. ‘Then,’ Mr Rocher continues, ‘we looked at correlations that may exist between these features. Thus “student + 20 + Louvain-la-Neuve” is hardly unique, while “student + 60 + Louvain-la-Neuve” is much more so because there are few 60-year-old students on campus. Our algorithm therefore combines the correlation and distribution of features to build a model of the population that allows for determining the probability that there exists, for example, in the Belgian population two men aged 30 born on 5 January 1989, living in Schaerbeek, with two daughters, a dog, and a red car. Presumably, there’s almost no chance for this to happen. So if you have identified a person who has these features, you can be sure he’s the right person, there is no other.’

Certain? ‘We have been able to show that in the US – we’ve worked mainly with American data – 15 demographic data (age, sex, etc.) are sufficient to make reidentification possible in 99.98% of cases! Each time, our model indicated the probability of another person with the same features as the reidentified one was almost zero. However, you should know that some anonymous data shared online contain several hundred features per person!’ The UCLouvain researchers are thus the first to have developed a generic model that makes it possible to certify an identification in any database.

Other anonymisation techniques

The researchers’ goal is not to ensure that data sharing is prohibited. Scientific research and the digital economy need it. ‘Instead, we advocate keeping data in secure environments, using privacy engineering methods. Researchers or companies could access them remotely and pose their questions, but the processing would be located at the database issuer, the user receiving only aggregated results. Such procedures already exist, for example for financial or medical data, but should become the norm.’. Luc Rocher's article about these questions is today's top news of The Conversation.

Henri Dupuis

A glance at Luc Rocher's bio

Luc Rocher always liked maths. He studies mathematics and computer science at the Ecole Normale Supérieure in Lyon. He likes research, especially during stays at the Massachusetts Institute of Technology (MIT). In 2015, he became an ICTEAM FNRS candidate and began pursuing a PhD thesis on the limits of personal data anonymisation and on searching for better techniques for sharing such data, under the guidance of Prof. Julien Hendrickx (UCLouvain) and Yves-Alexandre de Montjoye, head of the Computational Privacy Group at Imperial College London.