Abstract:A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can "make up" OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95. Codes are available in <a class="link-external link-https" href="https://github.com/MengyuanChen21/NeurIPS2024-CSP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the detection performance of pre - trained vision - language models (VLMs) on out - of - distribution (OOD) data in zero - shot scenarios. Specifically, the author proposes a new method to improve the existing OOD detection process based on the semantic pool. ### Problem Background Traditional OOD detection methods mainly rely on a single image modality, ignoring the rich text label information. With the development of pre - trained vision - language models, using text information for visual OOD detection has become an emerging paradigm. However, the existing methods have some limitations: 1. **Simple Lexical Expansion**: Directly expanding the lexicon can increase the size of the semantic pool, but it will introduce a large number of uncommon words and synonyms, resulting in a decrease in the activation probability of OOD labels, and there is a high - degree of dependence among these labels, which cannot meet the theoretical independence assumption. 2. **Performance Bottleneck**: As the proportion of selected OOD labels increases, the actual performance shows an inverted V - shaped trend rather than a monotonic increase. This indicates that simply increasing the size of the semantic pool cannot effectively improve the OOD detection performance. ### Paper Solution To solve the above problems, the author proposes to construct a "Conjugated Semantic Pool" (CSP) to improve OOD detection in the following ways: 1. **Construct Superclass Names**: The CSP consists of modified superclass names, and each superclass name serves as the center of a sample cluster with similar properties. For example, "white creature", "valuable item", "communal place", etc. 2. **Increase Activation Probability**: Since the superclass names cover a wider range of categories, their cluster centers have a higher probability of being activated by OOD samples, thereby increasing the expected activation probability \( q_2 \) of OOD labels. 3. **Reduce Dependence**: The distribution of cluster centers in the CSP in the feature space is significantly different from that of the original category cluster centers, reducing the mutual dependence between new and old labels and ensuring the independence assumption of Bernoulli variables. ### Experimental Results The experimental results show that after using the CSP to expand the OOD label candidate set, the FPR95 index is improved by 7.89% compared with the existing methods, which proves the effectiveness of this method. ### Theoretical Analysis The author also theoretically derives an optimization framework, pointing out that the key to improving OOD detection performance lies in: - Simultaneously increasing the size \( M \) of the semantic pool and the expected activation probability \( q_2 \) of OOD labels. - Ensuring that the activation states among the selected OOD labels maintain a low - degree of dependence. Through these improvements, the method proposed in the paper has achieved significant performance improvements on multiple OOD detection benchmarks. ### Summary The main contributions of this paper include: 1. Proposing a new theoretical framework to guide the improvement of OOD detection performance. 2. Analyzing the ineffectiveness of simple lexical expansion and proposing an alternative solution. 3. Constructing the Conjugated Semantic Pool (CSP) and proving its effectiveness through experiments. 4. Demonstrating superior performance on multiple OOD detection benchmarks. Through these innovations, the paper provides new ideas and methods for improving the performance of pre - trained vision - language models in OOD detection tasks.

Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

COOD: Concept-based Zero-shot OOD Detection

Matching Words for Out-of-distribution Detection

Delving into Out-of-Distribution Detection with Vision-Language Representations

Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

CLIPScope: Enhancing Zero-Shot OOD Detection with Bayesian Scoring

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

Out-of-Distribution Detection Using Peer-Class Generated by Large Language Model

AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models

Semantic Enhanced Few-shot Object Detection

CLIP-driven Outliers Synthesis for few-shot OOD detection

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure

Simple Image-level Classification Improves Open-vocabulary Object Detection

OOD Aware Supervised Contrastive Learning

Negative Label Guided OOD Detection with Pretrained Vision-Language Models

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Adaptive Adjustment with Semantic Embedding for Zero-Shot Object Detection