Abstract:We introduce a model of online algorithms subject to strict constraints on data retention. An online learning algorithm encounters a stream of data points, one per round, generated by some stationary process. Crucially, each data point can request that it be removed from memory $m$ rounds after it arrives. To model the impact of removal, we do not allow the algorithm to store any information or calculations between rounds other than a subset of the data points (subject to the retention constraints). At the conclusion of the stream, the algorithm answers a statistical query about the full dataset. We ask: what level of performance can be guaranteed as a function of $m$?

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to design online algorithms to ensure performance after data deletion requests under strict data retention limitations. Specifically, the paper focuses on the situation where each data point can be removed upon request m rounds after its arrival, and the algorithm can still maintain good statistical query performance. The paper explores this framework through multi - dimensional mean estimation and linear regression problems, and shows the performance levels that can be achieved under limited data retention conditions. ### Specific Problem Description 1. **Data Retention Limitations**: When processing data streams, algorithms must abide by strict retention periods. Each data point can be requested to be removed within m rounds after its arrival. The algorithm cannot store any information or calculation results and can only retain some data points (subject to retention limitations). 2. **Statistical Query Performance**: After the data stream ends, the algorithm needs to answer statistical queries about the entire data set. The paper studies the performance levels that algorithms can guarantee under such limited data retention conditions. ### Main Contributions - **Exponential Improvement**: The paper proposes an algorithm that, in multi - dimensional mean estimation and linear regression problems, can achieve the same mean - square error $\epsilon$ as the optimal algorithm (retaining all data) while only retaining $m = \text{Poly}(d,\log(1 / \epsilon))$ rounds of data. This shows that performance can be significantly improved even under limited data retention. - **Theoretical Analysis**: The paper provides nearly matching lower bounds, proving the rationality of the minimum retention rounds $m$ required to guarantee the error $\epsilon$. - **Practical Significance**: Research shows that even in non - adversarial environments, companies may inadvertently leak information that should be deleted simply to optimize algorithm performance. Therefore, relying solely on data retention laws cannot fully guarantee the "right to be forgotten". ### Summary of Mathematical Formulas - **Mean Estimation**: - For $d$-dimensional data points, if $m\geq\Theta(d\log(d / \epsilon))$, then for any query time $T > Cd / \epsilon$, the expected squared error does not exceed $\epsilon$. - The formula is expressed as: \[ \mathbb{E}[\| \theta - \hat{\theta} \|_2^2]\leq\epsilon \] - where $\theta$ is the true mean and $\hat{\theta}$ is the estimated mean. - **Linear Regression**: - For linear regression problems, if $m = \Theta(d^2\log(d)\log(d / \epsilon))$, then it can be guaranteed that the $\ell_2$ risk does not exceed $\epsilon$. - The formula is expressed as: \[ \mathbb{E}[|\hat{y} - Q_x(F)|^2]\leq\epsilon \] - where $\hat{y}$ is the predicted value and $Q_x(F)=\langle \theta, x \rangle$ is the true value. ### Conclusion By introducing a limited data retention model, this paper explores the performance of online algorithms under data deletion requests and shows that significant performance improvements can be achieved under specific conditions. These results emphasize the importance of considering data retention limitations when designing algorithms and point out the possible deficiencies of existing data protection laws.

Online Algorithms with Limited Data Retention

Learning-augmented Algorithms for Online Subset Sum

Online Learning: Sufficient Statistics and the Burkholder Method

Optimal Data Selection: An Online Distributed View

A rehearsal framework for computational efficiency in online continual learning

Algorithms for Efficient, Compact Online Data Stream Curation

Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation

Learning-augmented Online Minimization of Age of Information and Transmission Costs

Non-Asymptotic Performance of Social Machine Learning Under Limited Data

Quasar 3C298: a test-case for meteoritic nanodiamond 3.5 microns emission

Tight Bounds for Online Balanced Partitioning in the Generalized Learning Model

The Interplay Between Stability and Regret in Online Learning

Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony

Online Fair Allocation of Perishable Resources

Efficient Methods for Non-stationary Online Learning

No-Regret Caching via Online Mirror Descent

Overcoming Brittleness in Pareto-Optimal Learning-Augmented Algorithms

On the Complexity of Algorithms with Predictions for Dynamic Graph Problems

Online Learning with Bounded Recall

Towards An Online Incremental Approach to Predict Students Performance

Online Learning From Incomplete and Imbalanced Data Streams