Abstract:Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at \url{<a class="link-external link-https" href="https://github.com/Cverchen/TagFog" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in real - world applications, intelligent models often encounter samples with different training data distributions during deployment (i.e., out - of - distribution, OOD samples). These OOD samples usually come from unknown classes and did not appear during model training. Incorrectly classifying OOD samples as known in - distribution (ID) classes may lead to serious consequences, for example, in application scenarios such as autonomous driving and intelligent healthcare. Therefore, accurately detecting whether new data are OOD samples or belong to known classes is crucial for AI models.
Specifically, this research proposes a new learning framework TagFog (Textual Anchor Guidance and Fake Outlier Generation for Visual Out - of - Distribution Detection), aiming to improve OOD detection in the following ways:
1. **Generate fake OOD data**: Use simple Jigsaw transformation to generate fake OOD data to help the model better distinguish between ID and real OOD data.
2. **Utilize text anchor guidance**: Generate descriptions for each ID category through ChatGPT and input them into the pre - trained CLIP text encoder to obtain richer semantic embeddings as anchors to guide the training of the image encoder.
By combining these two methods, the TagFog framework can achieve state - of - the - art performance on multiple OOD detection benchmarks, thereby effectively improving the model's ability to detect OOD samples.
### Formula Summary
- **Cross - entropy loss function \( L_{CE} \)**:
\[
L_{CE}=-\frac{1}{N + M}\sum_{i = 1}^{N + M}\sum_{k = 1}^{K + 1}y_{i,k}\log(\hat{y}_{i,k})
\]
where \( N \) and \( M \) are the numbers of all ID training images and fake OOD images respectively, \(\hat{y}_{i,k}\) is the output probability that the \( i\) - th training image belongs to the \( k\) - th category, and \( y_{i,k}\) is the corresponding ground - truth output (0 or 1).
- **Contrastive loss \( L_{CI} \)**:
\[
L_{CI}=-\frac{1}{N}\sum_{n = 1}^N\sum_{k = 1}^K1(y_{n,k}\neq0)\cdot\log\left(\frac{\exp(s(z_n,\mu_k)/\tau)}{\sum_{j = 1}^K\exp(s(z_n,\mu_j)/\tau)}\right)
\]
where \( z_n = g(f(x_n))\) is the projected visual embedding of the input ID image \( x_n\), \( s(z_n,\mu_k)\) represents the cosine similarity between the two embeddings, \( 1(\cdot)\) is the indicator function, and \(\tau\) is the temperature scaling factor.
- **Supervised contrastive loss \( L_{SC} \)**:
\[
L_{SC}=-\frac{1}{S}\sum_{i = 1}^S\frac{1}{|P(i)|}\sum_{p\in P(i)}\log\left(\frac{\exp(s(z_i,z_p)/\tau')}{\sum_{a\in A(i)}\exp(s(z_i,z_a)/\tau')}\right)
\]
where \( S = N + M\), \( A(i)\) represents all sample indices in the mini - batch containing the sample with index \( i\), \( P(i)\) is a subset of \( A(i)\) in which all corresponding samples share the same category label as the sample with index \( i\).