Abstract:Out-of-distribution (OOD) detection, which aims to distinguish unknown classes from known classes, has received increasing attention recently. A main challenge within is the unavailable of samples from the unknown classes in the training process, and an effective strategy is to improve the performance for known classes. Using beneficial strategies such as data augmentation and longer training is thus a way to improve OOD detection. However, label smoothing, an effective method for classifying known classes, degrades the performance of OOD detection, and this phenomenon is under exploration. In this paper, we first analyze that the limited and predefined learning target in label smoothing results in the smaller maximal probability and logit, which further leads to worse OOD detection performance. To mitigate this issue, we then propose a novel regularization method, called adaptive label smoothing (ALS), and the core is to push the non-true classes to have same probabilities whereas the maximal probability is neither fixed nor limited. Extensive experimental results in six datasets with two backbones suggest that ALS contributes to classifying known samples and discerning unknown samples with clear margins. Our code will be available to the public.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the detection performance of out - of - distribution (OOD) samples in deep learning. Specifically, the author points out that although the label smoothing (LS) technique is effective in improving the classification performance of known categories, it will reduce the performance of OOD detection. This is because label smoothing will cause the maximum values of the predicted probability and logit to become smaller, thereby increasing the overlap between unknown samples and known samples, and further affecting the effect of OOD detection.
To overcome this problem, the author proposes an adaptive label smoothing (ALS) method. The core idea of ALS is to make the probabilities of non - real categories the same without fixing or limiting the maximum probability. In this way, ALS retains the advantages of label smoothing and reduces its negative impact on the OOD detection performance. Experimental results show that ALS improves the classification performance of known categories on multiple datasets and significantly enhances the ability of OOD detection.
### Main contributions
1. **Problem discovery**: The author discovers that the limitation and predefined nature of the learning target for the real category in label smoothing will reduce the performance of OOD detection.
2. **Method proposal**: A new regularization method - adaptive label smoothing (ALS) is proposed, in which the maximum probability is neither limited nor fixed, and non - real categories are pushed to have the same probability.
3. **Experimental verification**: Extensive experimental results show that ALS significantly improves the performance of OOD detection on six datasets and also performs well in the classification of known categories.
### Method details
- **Formalization and notation**:
- For multi - classification tasks, the input and label distribution spaces are represented as \(X\) and \(Y\) respectively.
- The model is first optimized on a training dataset \(D_{\text{tr}}\) sampled from the joint distribution \(X_{\text{tr}}\times Y_{\text{tr}}\), and then evaluated on a test dataset \(D_{\text{te}}\) sampled from the joint distribution \(X_{\text{te}}\times Y_{\text{te}}\).
- The cross - entropy loss function is used to optimize the model, and the formula is:
\[
H(y, p)=-\sum_{i = 1}^{N}y_i\log p_i
\]
where \(p\) is the predicted probability, \(p_i\) is the predicted probability of the \(i\)-th class, \(N\) is the total number of known categories, and \(y_i\) is the true label, which is 1 for the correct category and 0 for others.
- **Adaptive label smoothing**:
- ALS consists of two parts, and the formula is:
\[
L_{\text{ALS}}=L_{\text{MPC}}+\lambda\cdot L_{\text{NMPC}}
\]
where \(L_{\text{MPC}}\) and \(L_{\text{NMPC}}\) represent the losses of the maximum probability category (MPC) and non - maximum probability category (NMPC) respectively, and \(\lambda\) is a hyperparameter that balances the two parts.
- \(L_{\text{MPC}}\) directly borrows the cross - entropy loss, and the formula is:
\[
L_{\text{MPC}}=H(y, p)
\]
- The goal of \(L_{\text{NMPC}}\) is to make the probabilities of non - maximum probability categories the same, and the formula is:
\[
L_{\text{NMPC}}=\sqrt{\frac{1}{N - 1}\sum_{i = 1, i\neq k}^{N}(p_i-\bar{p})^2}
\]
where \(\bar{p}=\frac{1}{N - 1}\sum_{i = 1, i\neq k}^{N}p_i\) is non - maximum.