Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Fei Zhu,Xu-Yao Zhang,Zhen Cheng,Cheng-Lin Liu
2024-03-05
Abstract:Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores how to improve the reliability of deep neural networks (DNNs) in failure prediction tasks. Specifically, the author re - examines the current confidence estimation methods and studies the performance of these methods in detecting misclassified samples. #### Main problems 1. **Over - confidence problem**: Modern deep neural networks tend to show over - confidence in wrong predictions (i.e., misclassified samples and samples from unknown categories), that is, assigning too high confidence to these samples. This makes DNNs unreliable in practical applications, especially in safety - critical applications (such as autonomous driving, medical diagnosis, etc.). 2. **Limitations of existing methods**: - **Confidence Calibration**: Many existing confidence calibration methods can reduce the over - confidence of the model on samples in the training set, but they are not effective in detecting misclassified samples. - **Out - of - Distribution (OOD) Detection**: Existing OOD detection methods are mainly used to distinguish between known - category and unknown - category samples, but they are of limited help in detecting misclassified samples. 3. **Challenges in failure prediction**: Compared with confidence calibration and OOD detection, failure prediction (i.e., detecting misclassified samples) is a neglected but very important area. It has high practical value in practical applications because most input data come from known categories and misclassification errors are widespread. #### Research objectives By re - examining the existing confidence estimation methods, the author discovers the limitations of these methods in failure prediction tasks and proposes a new method based on flat minima to improve the failure prediction performance. Specific contributions include: - **Re - evaluating existing methods**: Through experiments, it is found that many popular confidence calibration and OOD detection methods are actually harmful to failure prediction. - **Theoretical analysis**: Analyze in detail the performance of calibration and OOD detection methods in failure prediction from the perspectives of appropriate scoring rules and Bayes - optimal rejection rules. - **Reliable over - fitting phenomenon**: Reveal an interesting reliable over - fitting phenomenon, that is, the failure prediction performance is prone to over - fitting during the training process. - **Proposing a new method**: Propose a method based on flat minima, which significantly improves the failure prediction performance and achieves state - of - the - art results in multiple classification scenarios. #### Conclusions This paper emphasizes the importance of re - evaluating existing confidence estimation methods, especially their performance in failure prediction tasks. By proposing new methods and techniques, the author not only improves the performance of failure prediction but also provides new ideas for developing more reliable and trustworthy machine - learning systems.