TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

Jinglun Li,Xinyu Zhou,Kaixun Jiang,Lingyi Hong,Pinxue Guo,Zhaoyu Chen,Weifeng Ge,Wenqiang Zhang
2024-08-28
Abstract:Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem encountered by deep - learning models in real - world application scenarios: **How to effectively detect out - of - distribution (OOD) samples**. Specifically, when a deep - learning model is only exposed to data of a specific distribution (i.e., in - distribution, IND) during training, but inevitably encounters new samples with a different distribution from the training data during actual deployment, the model needs to be able to identify these OOD samples and raise an alarm. Existing OOD detection methods usually rely on image - level features to construct a scoring function to identify OOD samples. However, this method has a major limitation: a single label cannot fully describe all the content in an image, especially when the image contains multiple objects. The model may learn some "OOD - like" features from IND data, leading to misclassification. In addition, irrelevant features such as background information will also interfere with the accuracy of OOD detection. To solve these problems, this paper proposes a new method - **TagOOD**, which realizes the label - free decoupling of object features in an image and generates more representative class centers by using visual - language representations and class center learning. This enables TagOOD to analyze at a more fine - grained semantic level, thereby improving the performance of OOD detection. ### Specific contributions 1. **Decoupling image features using a visual - language model**: TagOOD generates multiple semantic labels by using a pre - trained visual - language model, focuses on the semantic content of objects, and reduces the interference of background features. 2. **Generating object - level class centers**: By capturing the central tendency of IND objects, TagOOD can more effectively distinguish between IND and OOD samples, even if they contain similar objects. 3. **Experimental verification**: The authors conducted extensive experiments and ablation studies on multiple benchmark datasets, demonstrating the effectiveness and superiority of TagOOD. ### Summary TagOOD proposes a new OOD detection method by combining visual and language information, solving the problem that existing methods are easily interfered by irrelevant features when dealing with complex images. This innovation provides new ideas for the field of OOD detection and shows the potential of multi - modal information in improving model robustness.