Abstract:In the current era of big data and machine learning, it's essential to find ways to shrink the size of training dataset while preserving the training performance to improve efficiency. However, the challenge behind it includes providing practical ways to find points that can be deleted without significantly harming the training result and suffering from problems like underfitting. We therefore present the perfect deleted point problem for 1-step noisy SGD in the classical linear regression task, which aims to find the perfect deleted point in the training dataset such that the model resulted from the deleted dataset will be identical to the one trained without deleting it. We apply the so-called signal-to-noise ratio and suggest that its value is closely related to the selection of the perfect deleted point. We also implement an algorithm based on this and empirically show the effectiveness of it in a synthetic dataset. Finally we analyze the consequences of the perfect deleted point, specifically how it affects the training performance and privacy budget, therefore highlighting its potential. This research underscores the importance of data deletion and calls for urgent need for more studies in this field.

What problem does this paper attempt to address?

This paper attempts to address the issue of how to reduce the size of the training dataset to improve efficiency in the era of big data and machine learning without significantly compromising training performance. Specifically, the paper focuses on the classic linear regression task, finding data points that can be deleted through a 1-step noisy stochastic gradient descent (SGD) such that the model results after deleting these data points are the same as when they are not deleted. The main contributions of the paper include: 1. **Proposing a hypothesis testing method**: Used to address the data deletion problem in linear regression using noisy stochastic gradient descent. 2. **Minimizing impact on training performance**: Demonstrating that deleting perfect deletion points has the minimal impact on training performance compared to other potential deletion points. 3. **Minimizing privacy issues**: Showing that deleting perfect deletion points may bring about fewer privacy issues than deleting other points. 4. **Experimental validation of effectiveness**: Demonstrating the effectiveness and potential of perfect deletion points through experiments on synthetic datasets. ### Background and Motivation - **Importance of data deletion**: While more training data can generally improve training performance, it also brings many issues such as low data quality, increased energy consumption, and overfitting. Additionally, many countries and regions have enacted "right to be forgotten" regulations requiring companies to delete users' personal data. - **Current research status**: Existing data deletion methods mostly rely on a single metric (such as accuracy), leading to unreliable and inconsistent results. This paper proposes a new method that combines multiple metrics (such as model weight distribution, training loss, and privacy budget) to verify the selection of perfect deletion points. ### Methods and Theory - **Perfect deletion point problem**: Defines the concept of a perfect deletion point, which means that after deleting this point, the final result of the model is the same as when it is not deleted. - **Hypothesis testing method**: Finds perfect deletion points through hypothesis testing, using the signal-to-noise ratio (SNR) to evaluate the deletion effect of each data point. - **Algorithm implementation**: Designs a specific algorithm to find perfect deletion points by calculating the SNR of each data point and selecting the smallest absolute member advantage value. ### Experiments and Results - **Experimental setup**: Generated a 2-dimensional synthetic dataset containing 200 samples and conducted 100 iterations of 1-step noisy SGD experiments. - **Results analysis**: Verified that perfect deletion points are indeed optimal in maintaining the model weight distribution by comparing the distribution of model weights under different conditions. The effect of perfect deletion points becomes more apparent as the number of iterations increases. ### Conclusion and Future Work - **Conclusion**: The method proposed in this paper can effectively find perfect deletion points, which do not significantly affect training performance and can minimize privacy issues. - **Future work**: Further research on the performance on different types of datasets, exploring more improvements and optimization methods, and delving deeper into the trade-off relationship between α value and mean. Overall, this paper proposes an innovative method in the field of data deletion and demonstrates its effectiveness and potential through theory and experiments.

Data Deletion for Linear Regression with Noisy SGD

Approximate Data Deletion from Machine Learning Models

How to Prevent the Continuous Damage of Noises to Model Training?

The More Data, the Better? Demystifying Deletion-Based Methods in Linear Regression with Missing Data

Delete My Account: Impact of Data Deletion on Machine Learning Classifiers

DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

Noisy Truncated SGD: Optimization and Generalization

Adaptive Machine Unlearning

Impact of Noisy Labels on Sound Event Detection: Deletion Errors Are More Detrimental Than Insertion Errors

Differentially private stochastic gradient descent with low-noise

Certified Machine Unlearning via Noisy Stochastic Gradient Descent

On the Privacy of Noisy Stochastic Gradient Descent for Convex Optimization

Noisy Early Stopping for Noisy Labels

Dataset Distillers Are Good Label Denoisers In the Wild

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses

Differential Privacy of Noisy (S)GD under Heavy-Tailed Perturbations

RQP-SGD: Differential Private Machine Learning through Noisy SGD and Randomized Quantization

Certified Data Removal from Machine Learning Models

Find Important Training Dataset by Observing the Training Sequence Similarity

Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers