Driving Privacy Forward: Mitigating Information Leakage within Smart Vehicles through Synthetic Data Generation

Krish Parikh
2024-10-11
Abstract:Smart vehicles produce large amounts of data, much of which is sensitive and at risk of privacy breaches. As attackers increasingly exploit anonymised metadata within these datasets to profile drivers, it's important to find solutions that mitigate this information leakage without hindering innovation and ongoing research. Synthetic data has emerged as a promising tool to address these privacy concerns, as it allows for the replication of real-world data relationships while minimising the risk of revealing sensitive information. In this paper, we examine the use of synthetic data to tackle these challenges. We start by proposing a comprehensive taxonomy of 14 in-vehicle sensors, identifying potential attacks and categorising their vulnerability. We then focus on the most vulnerable signals, using the Passive Vehicular Sensor (PVS) dataset to generate synthetic data with a Tabular Variational Autoencoder (TVAE) model, which included over 1 million data points. Finally, we evaluate this against 3 core metrics: fidelity, utility, and privacy. Our results show that we achieved 90.1% statistical similarity and 78% classification accuracy when tested on its original intent while also preventing the profiling of the driver. The code can be found at <a class="link-external link-https" href="https://github.com/krish-parikh/Synthetic-Data-Generation" rel="external noopener nofollow">this https URL</a>
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the privacy problems caused by data leakage in intelligent vehicles. As a large amount of data generated by intelligent vehicles is widely collected and transmitted, the sensitive information in these data is at risk of being misused. In particular, it is still possible to profile the driver through meta - data analysis even after anonymization. To solve this problem, the paper proposes to use synthetic data as an alternative to reduce the risk of sensitive information leakage while maintaining the practicality and research value of the data. ### Specific Problem Description 1. **Privacy Leakage Risks**: - A large amount of data generated by intelligent vehicles contains sensitive information, such as the driver's behavior patterns, location information, etc. - Attackers can reconstruct the driver's personal profile by analyzing the anonymized meta - data in these data, thus violating privacy. 2. **Limitations of Existing Anonymization Technologies**: - Traditional anonymization technologies (such as k - anonymization, data masking, etc.) often lose the availability and accuracy of data while protecting privacy. - These technologies cannot completely prevent re - identification attacks, especially when the attacker has auxiliary information. 3. **Needs for Innovation and Research**: - A large amount of real data is still required for innovation and research under the premise of ensuring privacy. - Synthetic data can simulate the statistical characteristics of real data and avoid exposing sensitive information, thus meeting this need. ### Solution The paper proposes to solve the above problems by generating synthetic data. The specific steps are as follows: 1. **Sensor Classification and Risk Assessment**: - A comprehensive sensor classification method is proposed, which divides 14 kinds of sensor signals in the vehicle into high, medium, and low priorities, and evaluates their potential information leakage risks. - High - priority sensors (such as GPS data, camera data, etc.) directly identify individuals or track behaviors and are the key protection objects. 2. **Synthetic Data Generation**: - Use deep - learning models (such as Variational Autoencoder (VAE) and Generative Adversarial Network (GAN)) to generate synthetic data. - Taking the Passive Vehicular Sensor (PVS) dataset as an example, more than 1 million data points are generated using the Tabular Variational Autoencoder (TVAE) model. 3. **Evaluation Metrics**: - The generated synthetic data is evaluated from three core metrics (fidelity, practicality, privacy). - The results show that the synthetic data reaches 90.1% statistical similarity and 78% classification accuracy, while effectively preventing the reconstruction of the driver's profile. ### Formula Representation - **Variational Autoencoder (VAE)**: - The encoder maps the input data \(x\) to the distribution \(q(z|x)\) in the latent space, which is usually assumed to be a Gaussian distribution: \[ q(z|x)=\mathcal{N}(\mu(x),\sigma^{2}(x)) \] - The decoder maps the latent variable \(z\) back to the original data space: \[ p(x|z)=\text{decoder}(z) \] - Two objectives are optimized during the learning process: Reconstruction Loss and KL Divergence: \[ \mathcal{L}_{\text{VAE}}=\mathbb{E}_{q(z|x)}[\log p(x|z)]-\beta\cdot\text{KL}(q(z|x)\|p(z)) \] - **Generative Adversarial Network (GAN)**: - The objective functions of the generator \(G\) and the discriminator \(D\) are respectively: \[ \min_G\max_D V(D, G)=\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log D(x)]+\mathbb{E}_{z}