Abstract:Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. Hard one-hot labels are smoothed by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by regularising the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly explores the impact of label smoothing (LS) on the performance of selective classification (SC) and attempts to explain why LS reduces the effectiveness of SC. Specifically: 1. **Problem background**: - Label smoothing (LS) is a commonly used regularization technique for improving the test accuracy of neural networks. It reduces model over - fitting by mixing "hard" one - hot labels with a uniform distribution. - In the selective classification (SC) task, the model not only needs to classify but also needs to decide whether to reject certain classification results according to the predicted uncertainty in order to reduce the number of misclassifications. 2. **Deficiencies in existing research**: - Although LS performs well in improving classification accuracy, previous studies have shown that LS may harm the performance of selective classification (SC). However, the specific reasons for this phenomenon have not been fully explained. 3. **Research objectives**: - **Verify the impact of LS on SC**: Through extensive experiments, verify whether LS consistently reduces SC performance on a variety of large - scale tasks and architectures. - **Explain the reasons why LS affects SC**: Analyze the gradient of the LS loss function and explain why LS reduces SC performance. - **Propose solutions**: Explore how to recover the SC performance degradation caused by LS through post - processing methods such as logit normalization. 4. **Key contributions**: - **Empirical verification**: It is shown that LS consistently reduces SC performance on multiple large - scale tasks and architectures, and as the LS intensity increases, the performance degradation is more obvious. - **Theoretical explanation**: By analyzing the gradients at the logit level, it is revealed how LS affects the uncertainty ranking of correct and incorrect predictions, thereby harming SC performance. - **Solutions**: It is proved that post - processing logit normalization can effectively recover the SC performance degradation caused by LS, and its effectiveness is explained through gradient analysis. 5. **Formula summary**: - **Cross - entropy loss (CE)**: \[ L_{\text{CE}}(\theta)=-\frac{1}{N} \sum_{n = 1}^N \sum_{k = 1}^K \delta_{y^{(n)} \omega_k} \log P(\omega_k|x^{(n)};\theta) \] - **Label smoothing loss (LS)**: \[ L_{\text{LS}}(\theta;\alpha)=-\frac{1}{N} \sum_{n = 1}^N \sum_{k = 1}^K \left[(1 - \alpha)\delta_{y^{(n)} \omega_k}+\frac{\alpha}{K}\right] \log P(\omega_k|x^{(n)};\theta) \] - **Uncertainty estimate (U) after logit normalization**: \[ U=-v'_{\max}=-\max_k v'_k,\quad v'=\frac{v}{\|v\|_p} \] Through these studies, the author hopes to provide a new perspective for understanding the impact of LS on SC and practical methods for improving SC performance.

Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

Label Smoothing is Robustification against Model Misspecification

OT Cleaner: Label Correction As Optimal Transport

Cross Entropy versus Label Smoothing: A Neural Collapse Perspective

Learning with Noisy Labels Via Sparse Regularization

Learning label smoothing for text classification

Regularizing CNNs using Confusion Penalty Based Label Smoothing for Histopathology Images

Smooth Pseudo-Labeling

Label Smoothing for Text Mining.

Label Smoothing Improves Machine Unlearning

Rethinking Regularization with Random Label Smoothing

Generalizing Few Data to Unseen Domains Flexibly Based on Label Smoothing Integrated with Distributionally Robust Optimization

Improving Time Series Classification with Representation Soft Label Smoothing

Soft-label recover based label-specific features learning

The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration

Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks

Adaptive Label Smoothing for Out-of-Distribution Detection

ACLS: Adaptive and Conditional Label Smoothing for Network Calibration

Posterior Label Smoothing for Node Classification

Adaptive Regularization of Labels

Harnessing Side Information for Classification Under Label Noise