Mean Estimation with User-Level Privacy for Spatio-Temporal IoT Datasets

V. Arvind Rameshwar,Anshoo Tandon,Prajjwal Gupta,Aditya Vikram Singh,Novoneel Chakraborty,Abhay Sharma
2024-04-25
Abstract:This paper considers the problem of the private release of sample means of speed values from traffic datasets. Our key contribution is the development of user-level differentially private algorithms that incorporate carefully chosen parameter values to ensure low estimation errors on real-world datasets, while ensuring privacy. We test our algorithms on ITMS (Intelligent Traffic Management System) data from an Indian city, where the speeds of different buses are drawn in a potentially non-i.i.d. manner from an unknown distribution, and where the number of speed samples contributed by different buses is potentially different. We then apply our algorithms to large synthetic datasets, generated based on the ITMS data. Here, we provide theoretical justification for the observed performance trends, and also provide recommendations for the choices of algorithm subroutines that result in low estimation errors. Finally, we characterize the best performance of pseudo-user creation-based algorithms on worst-case datasets via a minimax approach; this then gives rise to a novel procedure for the creation of pseudo-users, which optimizes the worst-case total estimation error. The algorithms discussed in the paper are readily applicable to general spatio-temporal IoT datasets for releasing a differentially private mean of a desired value.
Cryptography and Security,Information Theory,Applications
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately publish the sample mean in the spatiotemporal Internet of Things (IoT) data set while protecting user privacy. Specifically, the article focuses on how to maintain a low estimation error while ensuring privacy when publishing the sample mean of vehicle speed values from traffic data sets. ### Problem Background 1. **Importance of Privacy Protection** - Even the release of seemingly harmless data set functions may lead to the reconstruction of personal identities, thereby revealing sensitive information. For example, the taxi data released by the New York City Taxi and Limousine Commission was successfully deanonymized, revealing sensitive information about drivers. - The Differential Privacy (DP) framework aims to ensure the privacy of individual data samples or users by adding noise, but traditional DP techniques are not effective in handling multi - contributor user data because a large amount of noise needs to be added to ensure privacy, which will lead to a large estimation error. 2. **Requirement for User - level Privacy** - In real - world data sets, such as traffic databases, each user may contribute multiple data samples. Direct application of standard DP techniques will lead to poor estimation errors. - User - level privacy requires the protection of multiple samples contributed by each user, not just a single sample. ### Research Objectives The main objective of this paper is to develop user - level differential privacy algorithms to ensure that when publishing the sample mean of vehicle speed values in traffic data sets, both user privacy can be protected and a low estimation error can be maintained. Specifically: - **Data Set**: The research uses data from the Intelligent Traffic Management System (ITMS) in a city in India, where speed samples of different buses may be drawn from an unknown distribution in a non - independent and identically distributed (non - i.i.d.) manner, and the number of speed samples contributed by different buses is also different. - **Algorithm Design**: The paper proposes several algorithms, including BASELINE, ARRAY - AVERAGING, LEVY, and QUANTILE. These algorithms reduce the amount of noise added by creating pseudo - users, truncating the number of samples and speed ranges, etc., thereby reducing the estimation error. ### Main Contributions 1. **User - level Differential Privacy Algorithms**: New user - level differential privacy algorithms have been developed, which can achieve a low estimation error on real - world data sets. 2. **Theoretical Analysis and Experimental Verification**: Theoretical basis has been provided to explain the observed performance trends, and the effectiveness of different algorithms has been verified through extensive experiments. 3. **Optimization of Pseudo - user Creation**: Through minimax analysis, a new method of pseudo - user creation has been proposed to optimize the total estimation error in the worst - case scenario. In conclusion, this paper solves the problem of how to effectively publish the sample mean of vehicle speed values in the spatiotemporal Internet of Things data set while protecting user privacy.