Universal generalization guarantees for Wasserstein distributionally robust models

Tam Le,Jérôme Malick
2024-05-29
Abstract:Distributionally robust optimization has emerged as an attractive way to train robust machine learning models, capturing data uncertainty and distribution shifts. Recent statistical analyses have proved that robust models built from Wasserstein ambiguity sets have nice generalization guarantees, breaking the curse of dimensionality. However, these results are obtained in specific cases, at the cost of approximations, or under assumptions difficult to verify in practice. In contrast, we establish, in this article, exact generalization guarantees that cover all practical cases, including any transport cost function and any loss function, potentially non-convex and nonsmooth. For instance, our result applies to deep learning, without requiring restrictive assumptions. We achieve this result through a novel proof technique that combines nonsmooth analysis rationale with classical concentration results. Our approach is general enough to extend to the recent versions of Wasserstein/Sinkhorn distributionally robust problems that involve (double) regularizations.
Optimization and Control,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to provide accurate and general generalization guarantees in machine - learning models, especially for distributionally robust optimization models based on the Wasserstein distance. Specifically, the paper aims to: 1. **Establish accurate generalization guarantees**: Existing generalization guarantees usually hold in specific situations or under assumptions that are difficult to verify. The goal of this paper is to establish accurate generalization guarantees applicable to a wide range of practical scenarios, including general transportation costs and parameterized loss functions. These results are especially applicable to deep - learning models without restrictive assumptions. 2. **Handle non - smooth loss functions**: Many existing results cannot cover deep - learning models that contain non - smooth basic blocks (such as ReLU activation functions, max - pooling operators, or optimization layers). This paper solves this problem by introducing variational analysis tools and is able to handle non - smooth robust objective functions. 3. **Extend to regularized Wasserstein robust models**: In addition to the standard Wasserstein robust model, this paper also considers the Wasserstein robust model with entropy regularization and provides corresponding generalization guarantees. This includes the case of double regularization, that is, adding regularization terms to both the constraint and the objective function simultaneously. ### Main contributions 1. **General generalization guarantees**: This paper provides accurate generalization guarantees in the form of \(\sup_{Q \in P(\Xi), W_c(\hat{P}_n, Q) \leq \rho} \mathbb{E}_{\xi \sim Q}[f(\xi)] \geq \mathbb{E}_{\xi \sim P}[f(\xi)]\), which are applicable to multiple machine - learning scenarios without restrictive assumptions. 2. **Handle non - smooth loss functions**: By introducing new proof methods, this paper can handle deep - learning models that contain non - smooth basic blocks, which are not covered by existing results. 3. **Generalization guarantees for regularized models**: This paper not only provides generalization guarantees for the standard Wasserstein robust model but also extends to the Wasserstein robust model with entropy regularization and provides corresponding generalization and excess - risk bounds. ### Technical details - **Definition of the critical radius \(\rho_{\text{crit}}\)**: \(\rho_{\text{crit}}\) is defined as \(\inf_{f \in F} \mathbb{E}_{\xi \sim P} \left[ \min \{ c(\xi, \zeta) : \zeta \in \arg \max_{\Xi} f \} \right]\), which plays the role of a degeneracy threshold in the generalization guarantee. - **Dual lower bound \(\lambda_{\text{low}}\)**: By using variational analysis tools, this paper establishes a lower bound \(\lambda_{\text{low}}\) for the dual variable, which is crucial for handling non - smooth loss functions. - **Concentration inequality**: Using the known uniform concentration theorem for Lipschitz functions, this paper derives a high - probability upper bound for the generalization bound. ### Conclusion This paper provides accurate and general generalization guarantees by directly handling the inherent non - smoothness of the robust problem, which are applicable to a wide range of machine - learning scenarios. These results not only cover the standard Wasserstein robust model but also extend to the model with entropy regularization. Future research can focus on practical aspects, such as designing efficient practical methods for choosing \(\rho\), and broader distributionally robust optimization algorithms.