An Open Source Python Library for Anonymizing Sensitive Data

Judith Sáinz-Pardo Díaz,Álvaro López García

2024-08-20

Abstract:Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.

Cryptography and Security,Databases,Software Engineering

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to anonymize sensitive data while complying with strict data protection regulations in the context of open science. Specifically, when researchers share and publish data, they need to ensure that these data do not disclose personal privacy and can meet the requirements of data protection regulations, such as the General Data Protection Regulation (GDPR) in the European Union and the Privacy Act in the United States. In addition, with the development of artificial intelligence (AI), machine learning (ML) and deep learning (DL) technologies, ensuring that the data used to train models are properly pre - processed and anonymized to prevent the spread of potential biases in the models is also one of the research focuses. To this end, the paper introduces an open - source Python library named **anjana**, which is specifically used for anonymizing sensitive tabular data. This library provides a variety of anonymization methods, and users can select appropriate techniques according to the characteristics of the data set to ensure that the data can still be used for scientific research and data analysis without exposing personal identity information. ### Main Anonymization Techniques The anonymization techniques mentioned in the paper include but are not limited to the following: 1. **k - anonymity (k - anonymity)**: Ensure that the number of records in each equivalence class is no less than \( k \). 2. **(α, k)-anonymity ((α, k)-anonymity)**: It not only requires k - anonymity, but also requires that the diversity of sensitive attributes reaches a certain standard. 3. **ℓ - diversity (ℓ - diversity)**: Ensure that the sensitive attribute values in each equivalence class have sufficient diversity. 4. **Entropy ℓ - diversity (entropy ℓ - diversity)**: A more strict diversity measure based on entropy calculation. 5. **Recursive (c, ℓ)-diversity (recursive (c, ℓ)-diversity)**: Prevent bias and inference attacks. 6. **t - closeness (t - closeness)**: Ensure that the distribution of sensitive attributes is close enough to the global distribution. 7. **δ - disclosure privacy (δ - disclosure privacy)**: Limit the maximum relative distance. 8. **Basic β - likeness (basic β - likeness)** and **Enhanced β - likeness (enhanced β - likeness)**: Provide more robust privacy models. Through these techniques, the anjana library can help researchers protect personal privacy while not affecting the availability and research value of data when processing sensitive data. ### Summary The core problem of this paper is: how to develop an easy - to - use and powerful tool to help researchers anonymize sensitive data under the premise of complying with data protection regulations, thereby promoting the development of open science.

An Open Source Python Library for Anonymizing Sensitive Data

Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19

Anonymization: The imperfect science of using data while preserving privacy

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Anonymously Analyzing Clinical Datasets

A Novel Geographic Partitioning System for Anonymizing Health Care Data

A Taxonomy of Syntactic Privacy Notions for Continuous Data Publishing

Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study (Preprint)

Privacy Risk Assessment for Synthetic Longitudinal Health Data

Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study

SecGraph: a Uniform and Open-Source Evaluation System for Graph Data Anonymization and De-Anonymization

Privacy-Preserving Data Analysis for the Federal Statistical Agencies

Mastering data privacy: leveraging K-anonymity for robust health data sharing

Textwash -- automated open-source text anonymisation

Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use

A Comprehensive Bibliometric Analysis on Social Network Anonymization: Current Approaches and Future Directions

Utility-based Anonymization for Privacy Preservation with Less Information Loss

Privacy-preserving data sharing infrastructures for medical research: systematization and comparison

Designing a Novel Approach Using a Greedy and Information-Theoretic Clustering-Based Algorithm for Anonymizing Microdata Sets

Privacy Preserving Data Publishing Anonymization Methods for Limiting Malicious Attacks in Healthcare Records

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization