Abstract:In the era of big data,personal data has become one of the important resources in every field of scientific research,business analysis,medical services,social computing and so on.The sharing and application of personal data can produce great economic or social value.However,the improper use of personal data is easy to disclose personal privacy information.How to solve the contradiction between data application and personal privacy has become one of the current research hotspots.When personal data is shared and used,it is necessary to delete the explicit identifier attributes like name of the individuals in advance,but the attacker can still reveal the identity privacy or some sensitive information of individuals,through one or more non-sensitive values of the quasiidentifier attributes（QI）,such as gender,age,and region,or some values of the sensitive attributes（SA）,such as salary and disease,in this data set.Most current data privacy studies often assume that the data set has simply one-to-one relationship between individuals and records,which is called single-record data.In order to protect personal privacy in single-record data,scholars have come up with a variety of typical privacy anonymous models,such as k-anonymity,l-diversity,（α,k）-anonymity,t-closeness andβ-likelihood,etc.But in practice,there are a large number of data sets in which one individual may correspond to multiple records,short for multi-record data.If these above privacy models are directly applied on the multi-record data,it may cause some new privacy risks.To protect the privacy of Individuals in multi-record data,several scholars have proposed Identity-reversed（IR） privacy models like IR k-anonymity,IR l-diversity and IR（α,β）-anonymity,as well as enhanced privacy models such as EIR（α,β）-diversity and EIR l-diversity,when considering that the background knowledge related to only QI information is known to the attacker;and a few numbers of scholars have developed（k,k m ）-anonymity and（k,l）-diversity models,supposing that the attacker may know the background knowledge of either QI information or SA information.However,all of these models cannot provide adequate protection for the privacy of individuals in multi-record data.This research analyzes the privacy disclosure problem in the situation of multi-record data when an attacker has more stronger background knowledge,and proposes a new privacy-preserving model as well as the corresponding algorithm to satisfy the stricter privacy needs in applications of multi-record data.In the first part,it discusses all kinds of the privacy risks in the situations that an attacker knows the background knowledge related to either one of and both of the QI and SA information and indicates the defects of the current privacy models.Also,it presents a new privacy disclosure problem named unclosed itemset fingerprint attack（UCIFA）,which is based on the attacking by using strong background knowledge.In the second part,to overcome the UCIFA problem,it requires each person’ s whole sensitive values expressed as the form of an itemset should be closed.If an individual’s SA itemset cannot satisfy the closure constraint by partitioning the records of individuals into several groups,then this itemset should be further processed by the mean of cracking.Based on these,a new privacy model named closure and enhanced identity-reserved l-diversity（CEIR l-diversity） is present,which requires that the QI values and the SA values of each individual should satisfy EIR l-diversity and the closure constraint respectively.In the third part,it develops an algorithm called data anonymization based on closure and enhanced l-diversity（DACEL） to make the multi-record data satisfy CEIR l-diversity.It consists of three core steps:firstly,dividing the records in a multi-record dataset into several QI-groups,so that the records of individuals in each group have similar QI-values and satisfy the constraint of EIR l-diversity;secondly,in each QI-group,cracking the sensitive itemset of each individual that contains non-closed subsets into several small itemsets,each of which must satisfy the closure constraint;finally,in each QI-group,generalizing the QI-values of all records to make the anonymized data table satisfy CEIR l-diversity.In the fourth part,the proposed privacy model and its corresponding algorithm,referring as CEL-method,is compared with two kinds of leading-edge methods on two public multi-record data sets.The results show that the CEL-method has robust performance on efficiently achieving the highest level of privacy protection for multi-record data at the cost of small information loss.In summary,in the practice of personal data application,attackers may have different levels of background knowledge to disclose personal privacy information.The privacy-preserving method proposed in this research is of universal significance for the application of multi-record data privacy protection in practice.

A divide-and-conquer approach to privacy-preserving high-dimensional big data release

A MapReduce Based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Cloud

Privacy Preserving Distributed DBSCAN Clustering

Effective Privacy Preserved Clustering Based on Voronoi Diagram

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

Scalable Iterative Implementation of Mondrian for Big Data Multidimensional Anonymisation

Dissemination of Anonymized Streaming Data.

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

UPA: an Automated, Accurate and Efficient Differentially Private Big-Data Mining System

A Dynamic Anonymization Privacy-Preserving Model Based on Hierarchical Sequential Three-Way Decisions

Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud

Preserving Privacy of High-Dimensional Data by l-Diverse Constrained Slicing

Privacy-Preserving Machine Learning Algorithms for Big Data Systems

SaC-FRAPP: a scalable and cost-effective framework for privacy preservation over big data on cloud.

A distributed computing model for big data anonymization in the networks

A Novel Geographic Partitioning System for Anonymizing Health Care Data

Utility-based Anonymization for Privacy Preservation with Less Information Loss

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

A Survey of Data Anonymization Techniques for Privacy-Preserving Mining in Bigdata

Research on data privacy protection method with one-to-multiple records

Differentially private data release through multidimensional partitioning