Abstract:In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, SoftAD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVaR and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can.

What problem does this paper attempt to address?

The paper primarily explores a phenomenon that occurs when optimizing different performance metrics (referred to as "criteria") in machine learning, known as "criterion collapse," particularly in the context of using Bernoulli distribution loss for binary classification tasks. Specifically, the authors focus on whether optimizing one specific metric implicitly also optimizes another metric, especially when the optimization goal is related to minimizing error rates. ### Main Problems Addressed by the Paper 1. **Criterion Collapse**: The paper first defines and explores the concept of criterion collapse, where in certain cases, optimizing one specific performance metric actually leads to the optimization of another different metric. For example, for the Bernoulli distribution loss function, optimizing the Conditional Value-at-Risk (CVaR) or Distributionally Robust Optimization (DRO) criteria is actually equivalent to minimizing the expected error rate. 2. **Relation to Surrogate Loss Functions**: The paper further discusses the phenomenon of criterion collapse when using surrogate loss functions for training. For instance, even when surrogate losses (such as log loss or other continuous loss functions) are used during training, optimization under certain criteria still inevitably leads to error rate minimization. 3. **Methods to Avoid Criterion Collapse**: Finally, the paper proposes how to avoid undesirable criterion collapse by choosing non-monotonic criteria. These non-monotonic criteria can better control different aspects of the loss distribution and may be more suitable for multi-objective optimization needs in practical applications. ### Key Contributions - **Theoretical Analysis**: The paper theoretically proves that under the Bernoulli distribution, multiple common criteria (including CVaR, DRO, and other expectation-based criteria) collapse to error rate minimization, meaning that optimizing these criteria is essentially optimizing the error rate. - **Empirical Studies**: The paper also empirically verifies that when using surrogate loss functions, some criteria (such as monotonic criteria) inevitably lead to error rate minimization, while other non-monotonic criteria can avoid this phenomenon. - **Methodological Innovation**: A new category of non-monotonic criteria is proposed, which can be used to design more flexible learning algorithms to address the multi-objective optimization challenges that may exist in practical applications. In summary, this paper aims to understand and address the issue of criterion collapse when optimizing different performance metrics in machine learning and proposes a method to avoid undesirable collapse phenomena through non-monotonic criteria. This is of significant importance for developing more flexible and practical machine learning algorithms.

Criterion Collapse and Loss Distribution Control

The Geometry and Calculus of Losses

Distributionally Robust Optimization under Distorted Expectations

DORO: Distributional and Outlier Robust Optimization

Robust variance-regularized risk minimization with concomitant scaling

On the curvature of the loss landscape

Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis

On the Rates of Convergence from Surrogate Risk Minimizers to the Bayes Optimal Classifier.

Decision Making with Side Information and Unbounded Loss Functions

On the Concentration of the Minimizers of Empirical Risks

Distributionally Robust Optimization with Bias and Variance Reduction

Improved scalability under heavy tails, without strong convexity

Sensitivity of causal distributionally robust optimization

Anytime-Valid Generalized Universal Inference on Risk Minimizers

Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-Type of Risk Bounds.

Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications

Exploring Local Norms in Exp-concave Statistical Learning

On Tail Decay Rate Estimation of Loss Function Distributions

Distributional regression: CRPS-error bounds for model fitting, model selection and convex aggregation