Abstract:Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem encountered by deep - learning models in real - world application scenarios: **How to effectively detect out - of - distribution (OOD) samples**. Specifically, when a deep - learning model is only exposed to data of a specific distribution (i.e., in - distribution, IND) during training, but inevitably encounters new samples with a different distribution from the training data during actual deployment, the model needs to be able to identify these OOD samples and raise an alarm. Existing OOD detection methods usually rely on image - level features to construct a scoring function to identify OOD samples. However, this method has a major limitation: a single label cannot fully describe all the content in an image, especially when the image contains multiple objects. The model may learn some "OOD - like" features from IND data, leading to misclassification. In addition, irrelevant features such as background information will also interfere with the accuracy of OOD detection. To solve these problems, this paper proposes a new method - **TagOOD**, which realizes the label - free decoupling of object features in an image and generates more representative class centers by using visual - language representations and class center learning. This enables TagOOD to analyze at a more fine - grained semantic level, thereby improving the performance of OOD detection. ### Specific contributions 1. **Decoupling image features using a visual - language model**: TagOOD generates multiple semantic labels by using a pre - trained visual - language model, focuses on the semantic content of objects, and reduces the interference of background features. 2. **Generating object - level class centers**: By capturing the central tendency of IND objects, TagOOD can more effectively distinguish between IND and OOD samples, even if they contain similar objects. 3. **Experimental verification**: The authors conducted extensive experiments and ablation studies on multiple benchmark datasets, demonstrating the effectiveness and superiority of TagOOD. ### Summary TagOOD proposes a new OOD detection method by combining visual and language information, solving the problem that existing methods are easily interfered by irrelevant features when dealing with complex images. This innovation provides new ideas for the field of OOD detection and shows the potential of multi - modal information in improving model robustness.

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection

YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection

Delving into Out-of-Distribution Detection with Vision-Language Representations

MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities

From Global to Local: Multi-scale Out-of-distribution Detection

Calibrated Out-of-Distribution Detection with a Generic Representation

General-Purpose Multi-Modal OOD Detection Framework

A Unified Approach to Semi-Supervised Out-of-Distribution Detection

Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection

Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving

Exploring Large Language Models for Multi-Modal Out-of-Distribution Detection

Negative Label Guided OOD Detection with Pretrained Vision-Language Models

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Out-of-Distribution Detection for LiDAR-based 3D Object Detection

COOD: Concept-based Zero-shot OOD Detection

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

OAL: Enhancing OOD Detection Using Latent Diffusion

MOODv2: Masked Image Modeling for Out-of-Distribution Detection

Out-of-Distribution Detection Using Peer-Class Generated by Large Language Model

Unveiling the unseen: novel strategies for object detection beyond known distributions