Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Jeremias Traub,Till J. Bungert,Carsten T. Lüth,Michael Baumgartner,Klaus H. Maier-Hein,Lena Maier-Hein,Paul F Jaeger
2024-10-19
Abstract:Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
Machine Learning,Computer Vision and Pattern Recognition,Methodology
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the common flaws in the evaluation of Selective Classification (SC) systems. Specifically, current evaluation methods are usually based on predefined rejection thresholds and assume fixed working points, which cannot comprehensively measure the overall performance of the system. To promote the progress of methodology, a multi - threshold evaluation metric similar to AUCROC (Area Under the Curve) in standard classification is required. #### Main problems: 1. **Limitations of existing evaluation methods**: Current selective classification evaluation mainly focuses on fixed working points and is based on preset rejection thresholds, which cannot comprehensively evaluate the overall performance of the system. 2. **Lack of multi - threshold evaluation metrics**: In standard classification, AUCROC can evaluate the performance of classifiers under different thresholds, but selective classification lacks a similar comprehensive evaluation metric. 3. **Deficiencies of existing metrics**: Existing multi - threshold metrics (such as AURC) have flaws and cannot fully reflect the risk and coverage of the system, especially performing poorly when dealing with silent failures. #### Solutions: The author proposes a new evaluation metric - Area Under the Generalized Risk Coverage curve (AUGRC). This metric can meet the following five key requirements: - **Task Alignment**: Consider both the classification performance and the ranking quality of the Confidence Scoring Function (CSF) simultaneously. - **Monotonicity**: Improving one factor (while keeping the other unchanged) should lead to a better metric value. - **Ranking Interpretability**: The metric should provide an intuitive assessment of ranking quality. - **CSF Flexibility**: Applicable to any arbitrarily selected confidence scoring function. - **Error Flexibility**: Applicable to different error functions (not just 0/1 errors). By introducing AUGRC, the author hopes to provide a more reliable and interpretable metric for the evaluation of selective classification systems, thereby promoting further development in this field. ### Formula summary: - **Selective Risk**: \[ \text{Selective Risk}(m,g)(\tau) := \frac{\sum_{i = 1}^{N}\ell(m(x_i),y_i)\cdot I(g(x_i)\geq\tau)}{\sum_{i = 1}^{N}I(g(x_i)\geq\tau)} \] - **Generalized Risk**: \[ \text{Generalized Risk}(m,g)(\tau) := \frac{1}{N}\sum_{i = 1}^{N}\ell(m(x_i),y_i)\cdot I[g(x_i)\geq\tau] \] - **AUGRC**: \[ \text{AUGRC}=\int_{0}^{1}P(Y_f = 1,g(x)\geq\tau)\,dP(g(x)\geq\tau) \] Through these formulas, AUGRC can better evaluate the overall performance of selective classification systems, especially providing a more accurate measure when dealing with silent failures.