Differentially Private Simple Linear Regression

Daniel Alabi,Audra McMillan,Jayshree Sarathy,Adam Smith,Salil Vadhan
DOI: https://doi.org/10.48550/arXiv.2007.05157
2020-07-10
Abstract:Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
Machine Learning,Cryptography and Security,Methodology
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores how to conduct simple linear regression analysis while protecting individual privacy. Specifically, the focus of the research is on designing Differential Privacy (DP) algorithms for simple linear regression when dealing with small datasets containing sensitive information. #### Main Issues 1. **Privacy Protection on Small Datasets**: Economic and social science research often requires fine-grained analysis of datasets containing sensitive personal information. However, such analysis may reveal individual information. Therefore, maintaining the validity of statistical estimates while protecting privacy becomes a key issue. 2. **Simple Linear Regression under Differential Privacy**: The paper investigates how to design simple linear regression algorithms that satisfy differential privacy constraints to ensure that the algorithm's output does not reveal any specific input data records. 3. **Challenges of Small Datasets**: For cases with small amounts of data (dozens to hundreds of data points), designing effective differential privacy algorithms is particularly challenging. #### Specific Goals - Provide a differential privacy algorithm such that when performing simple linear regression on small datasets, the added noise does not significantly increase uncertainty. - Validate the performance of different algorithms under various parameter settings through experiments and find the most suitable method for practical applications. - Pay special attention to the "Opportunity Atlas" tool in economics, which is used to study the relationship between children's growth environments and their economic mobility. Since the datasets are usually small (100 to 400 data points), effective differential privacy algorithms are needed to protect this data. #### Methods and Results - Several differential privacy algorithms based on robust estimators (such as the Theil-Sen estimator) were studied and compared with other standard methods. - It was found that algorithms based on the Theil-Sen estimator performed best on the smallest datasets, but as the dataset size increased, other standard algorithms performed better. - A series of experiments demonstrated that, under a wide range of real-world datasets and moderate privacy parameter values, a differential privacy linear regression algorithm could be found with an error smaller than the standard error. ### Conclusion The paper proposes a new differential privacy algorithm, DPExpTheilSen, which performs optimally in various scenarios. Additionally, the paper discusses the applicability of different algorithms under different dataset attributes, providing valuable insights for further research.