An Open Source Python Library for Anonymizing Sensitive Data

Judith Sáinz-Pardo Díaz,Álvaro López García
2024-08-20
Abstract:Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.
Cryptography and Security,Databases,Software Engineering
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to anonymize sensitive data while complying with strict data protection regulations in the context of open science. Specifically, when researchers share and publish data, they need to ensure that these data do not disclose personal privacy and can meet the requirements of data protection regulations, such as the General Data Protection Regulation (GDPR) in the European Union and the Privacy Act in the United States. In addition, with the development of artificial intelligence (AI), machine learning (ML) and deep learning (DL) technologies, ensuring that the data used to train models are properly pre - processed and anonymized to prevent the spread of potential biases in the models is also one of the research focuses. To this end, the paper introduces an open - source Python library named **anjana**, which is specifically used for anonymizing sensitive tabular data. This library provides a variety of anonymization methods, and users can select appropriate techniques according to the characteristics of the data set to ensure that the data can still be used for scientific research and data analysis without exposing personal identity information. ### Main Anonymization Techniques The anonymization techniques mentioned in the paper include but are not limited to the following: 1. **k - anonymity (k - anonymity)**: Ensure that the number of records in each equivalence class is no less than \( k \). 2. **(α, k)-anonymity ((α, k)-anonymity)**: It not only requires k - anonymity, but also requires that the diversity of sensitive attributes reaches a certain standard. 3. **ℓ - diversity (ℓ - diversity)**: Ensure that the sensitive attribute values in each equivalence class have sufficient diversity. 4. **Entropy ℓ - diversity (entropy ℓ - diversity)**: A more strict diversity measure based on entropy calculation. 5. **Recursive (c, ℓ)-diversity (recursive (c, ℓ)-diversity)**: Prevent bias and inference attacks. 6. **t - closeness (t - closeness)**: Ensure that the distribution of sensitive attributes is close enough to the global distribution. 7. **δ - disclosure privacy (δ - disclosure privacy)**: Limit the maximum relative distance. 8. **Basic β - likeness (basic β - likeness)** and **Enhanced β - likeness (enhanced β - likeness)**: Provide more robust privacy models. Through these techniques, the anjana library can help researchers protect personal privacy while not affecting the availability and research value of data when processing sensitive data. ### Summary The core problem of this paper is: how to develop an easy - to - use and powerful tool to help researchers anonymize sensitive data under the premise of complying with data protection regulations, thereby promoting the development of open science.