Abstract:Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly explores how to improve the reliability of deep neural networks (DNNs) in failure prediction tasks. Specifically, the author re - examines the current confidence estimation methods and studies the performance of these methods in detecting misclassified samples. #### Main problems 1. **Over - confidence problem**: Modern deep neural networks tend to show over - confidence in wrong predictions (i.e., misclassified samples and samples from unknown categories), that is, assigning too high confidence to these samples. This makes DNNs unreliable in practical applications, especially in safety - critical applications (such as autonomous driving, medical diagnosis, etc.). 2. **Limitations of existing methods**: - **Confidence Calibration**: Many existing confidence calibration methods can reduce the over - confidence of the model on samples in the training set, but they are not effective in detecting misclassified samples. - **Out - of - Distribution (OOD) Detection**: Existing OOD detection methods are mainly used to distinguish between known - category and unknown - category samples, but they are of limited help in detecting misclassified samples. 3. **Challenges in failure prediction**: Compared with confidence calibration and OOD detection, failure prediction (i.e., detecting misclassified samples) is a neglected but very important area. It has high practical value in practical applications because most input data come from known categories and misclassification errors are widespread. #### Research objectives By re - examining the existing confidence estimation methods, the author discovers the limitations of these methods in failure prediction tasks and proposes a new method based on flat minima to improve the failure prediction performance. Specific contributions include: - **Re - evaluating existing methods**: Through experiments, it is found that many popular confidence calibration and OOD detection methods are actually harmful to failure prediction. - **Theoretical analysis**: Analyze in detail the performance of calibration and OOD detection methods in failure prediction from the perspectives of appropriate scoring rules and Bayes - optimal rejection rules. - **Reliable over - fitting phenomenon**: Reveal an interesting reliable over - fitting phenomenon, that is, the failure prediction performance is prone to over - fitting during the training process. - **Proposing a new method**: Propose a method based on flat minima, which significantly improves the failure prediction performance and achieves state - of - the - art results in multiple classification scenarios. #### Conclusions This paper emphasizes the importance of re - evaluating existing confidence estimation methods, especially their performance in failure prediction tasks. By proposing new methods and techniques, the author not only improves the performance of failure prediction but also provides new ideas for developing more reliable and trustworthy machine - learning systems.

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Rethinking Confidence Calibration for Failure Prediction

Two Sides of Miscalibration: Identifying Over and Under-Confidence Prediction for Network Calibration

Towards More Reliable Confidence Estimation

Estimation of Small Failure Probability Based on Adaptive Subset Simulation and Deep Neural Network

Can I Trust You? Rethinking Calibration with Controllable Confidence Ranking

Rethinking Calibration of Deep Neural Networks: Do Not Be Afraid of Overconfidence

Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Errors for Deep Neural Networks

Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks

Confidence Estimation Using Unlabeled Data

An uncertainty-informed framework for trustworthy fault diagnosis in safety-critical applications

Confidence Calibration for Convolutional Neural Networks Using Structured Dropout

Structural Reliability Assessment Based on Low-Discrepancy Adaptive Importance Sampling and Artificial Neural Network

Confidence Intervals and Simultaneous Confidence Bands Based on Deep Learning

Confidence Calibration for Intent Detection Via Hyperspherical Space and Rebalanced Accuracy-Uncertainty Loss

Trustworthy Fault Diagnosis with Uncertainty Estimation through Evidential Convolutional Neural Networks

How to Fix a Broken Confidence Estimator: Evaluating Post-hoc Methods for Selective Classification with Deep Neural Networks

Improving Predictor Reliability with Selective Recalibration

Calibrated Reliable Regression using Maximum Mean Discrepancy

A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification

Accurate Uncertainties for Deep Learning Using Calibrated Regression