Data release for machine learning via correlated differential privacy

Hua Shen,Jiqiang Li,Ge Wu,Mingwu Zhang
DOI: https://doi.org/10.1016/j.ipm.2023.103349
2023-03-24
Abstract:Traditional correlated differential privacy technology usually introduces too much noise, reducing data availability. Besides, machine learning often confronts training sets of high-dimensional data, which brings heavy computing overhead. Aiming at the first issue, we design a more reasonable correlation analysis method. This method combines feature matching algorithms with information entropy-based feature importance to accurately calculate the correlated degree of records, reducing data correlation and correlated sensitivity and improving the data's utility. It is a novel evaluation method of the correlation of records that can alleviate the limitations of traditional calculating correlation methods. Based on this method, we provide a data release solution to reduce the data dimensionality and improve the training efficiency of machine learning by combining the maximum information coefficient with differential privacy. Furthermore, we introduce an optimization algorithm based on mutual information to choose the best principal components to improve the efficiency of our data release solution. To demonstrate the proposed solution's effectiveness and performance compared to existing schemes, we conducted experiments on three real-world datasets. The experimental results show that our scheme reduces the data correlation by up to 80% compared to existing schemes. Moreover, the accuracy of machine learning is improved by 10% to 20% for the same privacy budget.
computer science, information systems,information science & library science
What problem does this paper attempt to address?