Data Deletion for Linear Regression with Noisy SGD

Zhangjie Xia,Chi-Hua Wang,Guang Cheng
2024-10-12
Abstract:In the current era of big data and machine learning, it's essential to find ways to shrink the size of training dataset while preserving the training performance to improve efficiency. However, the challenge behind it includes providing practical ways to find points that can be deleted without significantly harming the training result and suffering from problems like underfitting. We therefore present the perfect deleted point problem for 1-step noisy SGD in the classical linear regression task, which aims to find the perfect deleted point in the training dataset such that the model resulted from the deleted dataset will be identical to the one trained without deleting it. We apply the so-called signal-to-noise ratio and suggest that its value is closely related to the selection of the perfect deleted point. We also implement an algorithm based on this and empirically show the effectiveness of it in a synthetic dataset. Finally we analyze the consequences of the perfect deleted point, specifically how it affects the training performance and privacy budget, therefore highlighting its potential. This research underscores the importance of data deletion and calls for urgent need for more studies in this field.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the issue of how to reduce the size of the training dataset to improve efficiency in the era of big data and machine learning without significantly compromising training performance. Specifically, the paper focuses on the classic linear regression task, finding data points that can be deleted through a 1-step noisy stochastic gradient descent (SGD) such that the model results after deleting these data points are the same as when they are not deleted. The main contributions of the paper include: 1. **Proposing a hypothesis testing method**: Used to address the data deletion problem in linear regression using noisy stochastic gradient descent. 2. **Minimizing impact on training performance**: Demonstrating that deleting perfect deletion points has the minimal impact on training performance compared to other potential deletion points. 3. **Minimizing privacy issues**: Showing that deleting perfect deletion points may bring about fewer privacy issues than deleting other points. 4. **Experimental validation of effectiveness**: Demonstrating the effectiveness and potential of perfect deletion points through experiments on synthetic datasets. ### Background and Motivation - **Importance of data deletion**: While more training data can generally improve training performance, it also brings many issues such as low data quality, increased energy consumption, and overfitting. Additionally, many countries and regions have enacted "right to be forgotten" regulations requiring companies to delete users' personal data. - **Current research status**: Existing data deletion methods mostly rely on a single metric (such as accuracy), leading to unreliable and inconsistent results. This paper proposes a new method that combines multiple metrics (such as model weight distribution, training loss, and privacy budget) to verify the selection of perfect deletion points. ### Methods and Theory - **Perfect deletion point problem**: Defines the concept of a perfect deletion point, which means that after deleting this point, the final result of the model is the same as when it is not deleted. - **Hypothesis testing method**: Finds perfect deletion points through hypothesis testing, using the signal-to-noise ratio (SNR) to evaluate the deletion effect of each data point. - **Algorithm implementation**: Designs a specific algorithm to find perfect deletion points by calculating the SNR of each data point and selecting the smallest absolute member advantage value. ### Experiments and Results - **Experimental setup**: Generated a 2-dimensional synthetic dataset containing 200 samples and conducted 100 iterations of 1-step noisy SGD experiments. - **Results analysis**: Verified that perfect deletion points are indeed optimal in maintaining the model weight distribution by comparing the distribution of model weights under different conditions. The effect of perfect deletion points becomes more apparent as the number of iterations increases. ### Conclusion and Future Work - **Conclusion**: The method proposed in this paper can effectively find perfect deletion points, which do not significantly affect training performance and can minimize privacy issues. - **Future work**: Further research on the performance on different types of datasets, exploring more improvements and optimization methods, and delving deeper into the trade-off relationship between α value and mean. Overall, this paper proposes an innovative method in the field of data deletion and demonstrates its effectiveness and potential through theory and experiments.