Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization

Qi Zhang,Yi Zhou,Ashley Prater-Bennette,Lixin Shen,Shaofeng Zou
2024-04-01
Abstract:Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes $\chi^2$-divergences as a special case. We prove that our algorithm finds an $\epsilon$-stationary point with a computational complexity of $\mathcal O(\epsilon^{-3k_*-5})$, where $k_*$ is the parameter of the Cressie-Read divergence. The numerical results indicate that our method outperforms existing methods.} Our method also applies to the smoothed conditional value at risk (CVaR) DRO.
Machine Learning
What problem does this paper attempt to address?
This paper focuses on the problem of Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization (DRO). In machine learning, traditional empirical risk minimization methods may suffer from performance degradation due to mismatch between training and testing data distributions. The DRO framework has been proposed to train models that are robust to data distribution changes, by finding the solution that minimizes the expected loss under the worst-case scenario within an uncertainty set. The paper specifically addresses the non-convex constrained DRO, which has been less explored in previous research, especially when the loss function is non-convex, such as in neural networks. The authors propose a new stochastic algorithm with computational complexity independent of the overall dataset size in each iteration, making it suitable for large-scale applications. They focus on uncertainty sets based on the Cressie-Read family distance, which includes χ2 divergence as a special case, and also investigate the Conditional Value-at-Risk (CVaR) DRO problem with smooth conditional value functions. The challenges faced in the paper include: 1. In large-scale applications, direct computation of the full gradient is not feasible due to the large number of training samples, requiring efficient methods that can estimate gradients using a small number of samples. 2. The non-convex loss function makes it difficult to generalize existing methods. 3. The Lagrangian dual form of the constrained DRO is neither smooth nor Lipschitz, making convergence analysis difficult. The main contributions of the paper are: 1. Designing a new stochastic algorithm to solve the non-convex constrained DRO problem with biased estimation, with computational complexity independent of the training data size in each iteration. 2. Proposing a Frank-Wolfe update method for Lagrange multipliers to control the gap between the objective function and its optimal value. 3. The algorithm can be applied to solve the non-convex constrained DRO problem, converging to a local minimum and outperforming existing methods in numerical experiments. Through these contributions, the paper provides effective tools for handling large-scale non-convex constrained distributionally robust optimization, improving the robustness of models to changes in data distribution.