Abstract:Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in Stochastic Convex Optimization (SCO), how many data points must be observed in order for any Empirical Risk Minimizer (ERM) to exhibit good performance on the true population. Specifically, the authors focus on the worst - case sample complexity problem of ERMs in SCO. ### Background and Problem Description Stochastic Convex Optimization is a benchmark framework widely used to study stochastic optimization algorithms (such as gradient descent and its variants) in modern machine learning. Although the SCO model is widely used in learning, a central fundamental problem has remained unresolved: how many data points are required to ensure that any ERM performs well on the true population? This problem was proposed by Feldman, and it was proven that at least $\Omega\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)$ data points are required (where $d$ is the dimension and $\epsilon>0$ is the precision parameter). However, proving a tighter lower bound $\omega\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)$ remains an open problem. ### Main Contributions of the Paper In this paper, the authors prove that in fact $\tilde{O}\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)$ data points are also sufficient. This result solves the above - mentioned open problem and reveals a new separation between ERMs and uniform convergence. Specifically: 1. **Sample Complexity**: For the classical setting of learning bounded convex Lipschitz functions, the sample complexity is $\tilde{O}\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)$. 2. **Generalization Results**: The authors further generalize this result, showing that a similar upper bound holds for all symmetric convex bodies. The generalized upper bound consists of two parts: - A term of the form $\tilde{O}\left(\frac{d}{\epsilon}\right)$ with an inverse linear dependence on the precision parameter. - A term that depends on the statistical complexity of the linear function class (captured by the Rademacher complexity). ### Methods and Techniques The authors prove the above results by establishing a mechanism to control the behavior of the stochastic convex optimization problem. Key techniques include: - **First - Order Optimality Conditions**: Utilize the first - order optimality conditions of the stochastic convex optimization problem. - **Bregman Divergence**: Prove concentration results by combining the non - negativity and boundedness of the Bregman divergence with the Bernstein inequality. - **Covering Numbers**: Use standard covering number bounds to handle the sample complexity problem in high - dimensional spaces. ### Conclusions This paper not only solves the worst - case sample complexity problem of ERMs in stochastic convex optimization but also reveals the difference between ERMs and uniform convergence. This result is of great significance for understanding the relationship between optimization and generalization, especially in high - dimensional and complex models.

The Sample Complexity Of ERMs In Stochastic Convex Optimization

Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-Type of Risk Bounds.

The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

On the Performance of Empirical Risk Minimization with Smoothed Data

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization.

Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

The Power of Sampling: Dimension-free Risk Bounds in Private ERM

Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

Sample Complexity of Robust Learning against Evasion Attacks

Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

Generalization Error Bounds for Optimization Algorithms Via Stability

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity

Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition

Universal Rates of Empirical Risk Minimization

Empirical Bayes via ERM and Rademacher complexities: the Poisson model

ERM Learning with Unbounded Sampling

Learning Lipschitz Operators with respect to Gaussian Measures with Near-Optimal Sample Complexity