Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

Guoxuan Xia,Olivier Laurent,Gianni Franchi,Christos-Savvas Bouganis
2024-10-11
Abstract:Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. Hard one-hot labels are smoothed by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by regularising the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores the impact of label smoothing (LS) on the performance of selective classification (SC) and attempts to explain why LS reduces the effectiveness of SC. Specifically: 1. **Problem background**: - Label smoothing (LS) is a commonly used regularization technique for improving the test accuracy of neural networks. It reduces model over - fitting by mixing "hard" one - hot labels with a uniform distribution. - In the selective classification (SC) task, the model not only needs to classify but also needs to decide whether to reject certain classification results according to the predicted uncertainty in order to reduce the number of misclassifications. 2. **Deficiencies in existing research**: - Although LS performs well in improving classification accuracy, previous studies have shown that LS may harm the performance of selective classification (SC). However, the specific reasons for this phenomenon have not been fully explained. 3. **Research objectives**: - **Verify the impact of LS on SC**: Through extensive experiments, verify whether LS consistently reduces SC performance on a variety of large - scale tasks and architectures. - **Explain the reasons why LS affects SC**: Analyze the gradient of the LS loss function and explain why LS reduces SC performance. - **Propose solutions**: Explore how to recover the SC performance degradation caused by LS through post - processing methods such as logit normalization. 4. **Key contributions**: - **Empirical verification**: It is shown that LS consistently reduces SC performance on multiple large - scale tasks and architectures, and as the LS intensity increases, the performance degradation is more obvious. - **Theoretical explanation**: By analyzing the gradients at the logit level, it is revealed how LS affects the uncertainty ranking of correct and incorrect predictions, thereby harming SC performance. - **Solutions**: It is proved that post - processing logit normalization can effectively recover the SC performance degradation caused by LS, and its effectiveness is explained through gradient analysis. 5. **Formula summary**: - **Cross - entropy loss (CE)**: \[ L_{\text{CE}}(\theta)=-\frac{1}{N} \sum_{n = 1}^N \sum_{k = 1}^K \delta_{y^{(n)} \omega_k} \log P(\omega_k|x^{(n)};\theta) \] - **Label smoothing loss (LS)**: \[ L_{\text{LS}}(\theta;\alpha)=-\frac{1}{N} \sum_{n = 1}^N \sum_{k = 1}^K \left[(1 - \alpha)\delta_{y^{(n)} \omega_k}+\frac{\alpha}{K}\right] \log P(\omega_k|x^{(n)};\theta) \] - **Uncertainty estimate (U) after logit normalization**: \[ U=-v'_{\max}=-\max_k v'_k,\quad v'=\frac{v}{\|v\|_p} \] Through these studies, the author hopes to provide a new perspective for understanding the impact of LS on SC and practical methods for improving SC performance.