Anonymization: what really is and how to achieve it?

Since the General Data Protection Regulation (GDPR) recommended this data security measure, the concept has become a hot topic of discussion among organizations and business owners. Recital 26 of this regulation defines anonymous information as “…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”, and points out that principles of data protection does not apply to anonymized information.

Data anonymization is a data processing technique that removes or modifies sensitive information about people. The idea is to alter the data in such a way that the identity or some characteristics of the people whose data are being processed can not be revealed. But… How does the procedure work?

In order to anonymize correctly the data, the first step is to analyse the structure of the dataset. The focus is to obtain a classification of the variables or attributes with information into three categories:

Identifiers: variables which identify univocally record owners, such as DNI, mobile phone, social security number, etc.
Quasi-identifiers: variables with information which may partially reveal the owner's identity through the linking with external data that share the same quasi-identifiers, but alone they do not reveal information, e.g., gender, state of residence, etc.
Sensitive attributes: variables that can potentially cause societal or personal harm if linked with an identifier or variables we will want to draw conclusions. Examples include medical diagnoses like HIV or COVID-19, salaries, etc.

Once the structure of the data is known, the second step is to apply an anonymization technique to the dataset. Available options could be:

Generalization: which replaces the original value with a less specific but semantically consistent value. For example, a variable such as age could be grouped in 0-10, 10-20, 20-30, etc. and the mean value of the interval could be used instead of the original value.
Randomization: that consists of adding a certain randomness to numeric data values, for example, generating values from a Gaussian distribution with zero mean and adding them up to the original values of the variable.
Deletion: that completely eliminates the variable or attribute.

Regarding the generalization of the information contained in the dataset, k-anonymity, l-diversity and t-closeness are alternatives to anonymize data. In the first case, privacy is achieved by grouping attributes from at least k individuals . l-diversity extends this idea by ensuring that each aggregated attribute will contain at least I different values, and t-closeness improves the previous methods by preserving the original data distribution, guaranteeing that each value of the aggregated attributes will appear in the anonymized data as many times as in the original dataset. Focusing on randomization techniques, some examples of them could be noise addition, permutations or the very in vogue differential privacy which big companies such as Apple, Amazon or Facebook apply. The first option is considered the simplest way of randomizing which adds statistical noise to a dataset but keeping the same distribution . Permutation tecnhiques involves shuffling the relationships of the dataset, linking a certain sensitive attribute to other individuals while differential privacy introduces statistical noise as a mean of ensuring that the risk incurred by participating in a dataset is only marginally greater than the risk of not participating in it.

Nowaday, Gradiant is participating in the H2020 project INFINITECH, researching and working on the development of a tool for anonymizing personal data that determines automatically the best anonymization strategy incorporating new advanced techniques.

Author: Marta Sestelo, Technical Manager of Data Analytics & AI at Gradiant

______

P. Samarati and L. Sweeney, “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression,” 1998, [Online]. Available: http://epic.org/privacy/reidentification/Samarati_Sweeney_paper.pdf.
A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond k-anonymity,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, p. 3–es, 2007.
N. Li, T. Li, and S. Venkatasubramanian, “t-Closeness: Privacy Beyond k-Anonymity and l-Diversity,” in 2007 IEEE 23rd International Conference on Data Engineering, Apr. 2007, pp. 106–115.
K. Mivule, “Utilizing Noise Addition for Data Privacy, an Overview,” arXiv [cs.CR], Sep. 16, 2013.
Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, “Aggregate Query Answering on Anonymized Tables,” in 2007 IEEE 23rd International Conference on Data Engineering, Apr. 2007, pp. 116–125.
C. Dwork, A. Roth, and Others, “The algorithmic foundations of differential privacy,” Found. Trends
Theor. Comput. Sci., vol. 9, no. 3–4, pp. 211–407, 2014.