Abstract:Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the common flaws in the evaluation of Selective Classification (SC) systems. Specifically, current evaluation methods are usually based on predefined rejection thresholds and assume fixed working points, which cannot comprehensively measure the overall performance of the system. To promote the progress of methodology, a multi - threshold evaluation metric similar to AUCROC (Area Under the Curve) in standard classification is required. #### Main problems: 1. **Limitations of existing evaluation methods**: Current selective classification evaluation mainly focuses on fixed working points and is based on preset rejection thresholds, which cannot comprehensively evaluate the overall performance of the system. 2. **Lack of multi - threshold evaluation metrics**: In standard classification, AUCROC can evaluate the performance of classifiers under different thresholds, but selective classification lacks a similar comprehensive evaluation metric. 3. **Deficiencies of existing metrics**: Existing multi - threshold metrics (such as AURC) have flaws and cannot fully reflect the risk and coverage of the system, especially performing poorly when dealing with silent failures. #### Solutions: The author proposes a new evaluation metric - Area Under the Generalized Risk Coverage curve (AUGRC). This metric can meet the following five key requirements: - **Task Alignment**: Consider both the classification performance and the ranking quality of the Confidence Scoring Function (CSF) simultaneously. - **Monotonicity**: Improving one factor (while keeping the other unchanged) should lead to a better metric value. - **Ranking Interpretability**: The metric should provide an intuitive assessment of ranking quality. - **CSF Flexibility**: Applicable to any arbitrarily selected confidence scoring function. - **Error Flexibility**: Applicable to different error functions (not just 0/1 errors). By introducing AUGRC, the author hopes to provide a more reliable and interpretable metric for the evaluation of selective classification systems, thereby promoting further development in this field. ### Formula summary: - **Selective Risk**: \[ \text{Selective Risk}(m,g)(\tau) := \frac{\sum_{i = 1}^{N}\ell(m(x_i),y_i)\cdot I(g(x_i)\geq\tau)}{\sum_{i = 1}^{N}I(g(x_i)\geq\tau)} \] - **Generalized Risk**: \[ \text{Generalized Risk}(m,g)(\tau) := \frac{1}{N}\sum_{i = 1}^{N}\ell(m(x_i),y_i)\cdot I[g(x_i)\geq\tau] \] - **AUGRC**: \[ \text{AUGRC}=\int_{0}^{1}P(Y_f = 1,g(x)\geq\tau)\,dP(g(x)\geq\tau) \] Through these formulas, AUGRC can better evaluate the overall performance of selective classification systems, especially providing a more accurate measure when dealing with silent failures.

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

The Misuse of AUC: What High Impact Risk Assessment Gets Wrong

A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification

Multiclass ROC

Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification

A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators

Decision Curve Analysis: a Technical Note

Uncertainty-aware Evaluation of Machine Learning Performance in binary Classification Tasks

AUC Optimization with a Reject Option

On Fixing the Right Problems in Predictive Analytics: AUC Is Not the Problem

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Evaluating accuracy and fairness of clinical decision support algorithms when health care resources are limited

A Closer Look at AUROC and AUPRC under Class Imbalance

Reducing the overfitting in the gROC curve estimation

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection

A multiple testing framework for diagnostic accuracy studies with co-primary endpoints

Novel decision-theoretic and risk-stratification metrics of predictive performance: Application to deciding who should undergo genetic testing

Analysis and Comparison of Classification Metrics

AUCμ: A Performance Metric for Multi-Class Machine Learning Models

AUCReshaping: improved sensitivity at high-specificity