The Sample Complexity Of ERMs In Stochastic Convex Optimization

Daniel Carmon, Roi Livni, Amir Yehudayoff
2023-11-10
Abstract:Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in Stochastic Convex Optimization (SCO), how many data points must be observed in order for any Empirical Risk Minimizer (ERM) to exhibit good performance on the true population. Specifically, the authors focus on the worst - case sample complexity problem of ERMs in SCO. ### Background and Problem Description Stochastic Convex Optimization is a benchmark framework widely used to study stochastic optimization algorithms (such as gradient descent and its variants) in modern machine learning. Although the SCO model is widely used in learning, a central fundamental problem has remained unresolved: how many data points are required to ensure that any ERM performs well on the true population? This problem was proposed by Feldman, and it was proven that at least \(\Omega\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)\) data points are required (where \(d\) is the dimension and \(\epsilon>0\) is the precision parameter). However, proving a tighter lower bound \(\omega\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)\) remains an open problem. ### Main Contributions of the Paper In this paper, the authors prove that in fact \(\tilde{O}\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)\) data points are also sufficient. This result solves the above - mentioned open problem and reveals a new separation between ERMs and uniform convergence. Specifically: 1. **Sample Complexity**: For the classical setting of learning bounded convex Lipschitz functions, the sample complexity is \(\tilde{O}\left(\frac{d^{\epsilon + 1}}{\epsilon^{2}}\right)\). 2. **Generalization Results**: The authors further generalize this result, showing that a similar upper bound holds for all symmetric convex bodies. The generalized upper bound consists of two parts: - A term of the form \(\tilde{O}\left(\frac{d}{\epsilon}\right)\) with an inverse linear dependence on the precision parameter. - A term that depends on the statistical complexity of the linear function class (captured by the Rademacher complexity). ### Methods and Techniques The authors prove the above results by establishing a mechanism to control the behavior of the stochastic convex optimization problem. Key techniques include: - **First - Order Optimality Conditions**: Utilize the first - order optimality conditions of the stochastic convex optimization problem. - **Bregman Divergence**: Prove concentration results by combining the non - negativity and boundedness of the Bregman divergence with the Bernstein inequality. - **Covering Numbers**: Use standard covering number bounds to handle the sample complexity problem in high - dimensional spaces. ### Conclusions This paper not only solves the worst - case sample complexity problem of ERMs in stochastic convex optimization but also reveals the difference between ERMs and uniform convergence. This result is of great significance for understanding the relationship between optimization and generalization, especially in high - dimensional and complex models.