KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation

Anantaa Kotal,Anupam Joshi
2024-09-26
Abstract:The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model's capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.
Machine Learning,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenges faced when generating synthetic data that meets privacy - protection requirements, especially the problems encountered when generating highly realistic tabular data in critical areas such as network security and healthcare. Specifically, the paper focuses on the following aspects: 1. **Limited data diversity**: The amount of data available in the training set is limited, especially for discrete attributes. This may cause the generation model to only replicate the specific values observed during training and fail to consider other possible values. For example, in a network activity data set, the available IP addresses are both limited and private, which restricts the generation model from considering a wider range of valid IP addresses, thereby increasing the privacy risk. 2. **Complexity of discrete attributes**: When the generation model processes tabular data, it classifies attributes as either continuous or discrete. The complexity of discrete attributes is proportional to their value range. In real - world scenarios, the set of discrete values is usually large, and a high input cardinality makes the generation model difficult to handle. For example, a port number can be any valid number represented by 2 bytes. Therefore, considering all possible port values will increase the complexity of the learning model by \(2^{16}\) times. For multiple attributes, the complexity will increase further. 3. **Domain constraints**: In specific domains (such as network security), there are specific rules between attributes that are not obvious from the training data alone. For example, in network activity data, specific protocols and events are associated with specific port numbers. If the port number does not match the network event, it is not only impossible but also wrong. This difference is not obvious to the generation model, resulting in generated data that is not realistic and may mislead downstream tasks, such as a classifier distinguishing between legitimate traffic and attack traffic. To solve these problems, the paper proposes a framework named KIPPS (Knowledge Infusion in Privacy Preserving Synthetic Data Generation). This framework enhances the training process of Generative Adversarial Networks (GAN) and Conditional Generative Adversarial Networks (cGAN) in the following ways: 1. **Adding domain context to training data**: - **Replacing attributes with domain attributes**: Use domain knowledge to replace the specific values of sensitive attributes with general attributes, helping the generation model learn patterns with similar attributes instead of simply replicating specific values. - **Grouping by attributes**: Utilize domain knowledge to group attribute values according to their attributes, reducing the input dimension and alleviating the computational burden on the model. - **Conditional rules**: Supplement domain knowledge in the training data to help the model understand and apply the conditional limitations in the data set. 2. **Conditional training with domain - rule - enforced loss**: - **Conditional training**: Represent tabular data as input to the generation model, use one - hot vectors to represent discrete attributes, and add conditional rules as one - hot vectors of binary flags. - **Domain - rule - enforced generator loss**: Train the model using the WGAN loss function and gradient penalty to ensure that the generated data complies with domain rules. Through these methods, the KIPPS framework aims to generate synthetic data that not only meets privacy - protection requirements but also truly reflects the characteristics of the original data, thereby providing better practicality and security in data sharing and analysis.