Abstract:Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.

What problem does this paper attempt to address?

The paper attempts to address the problem of effectively identifying new labels (i.e., zero-shot learning) that have not been seen during the training phase in multi-label image classification tasks. Specifically, the paper focuses on how to leverage the knowledge of pre-trained Vision-Language Models (VLP) to improve the recognition performance of unseen labels in an open vocabulary setting. Traditional zero-shot learning methods often rely on knowledge transfer from the text modality, neglecting the visual modality and its cross-modal semantic knowledge, leading to poor performance. Therefore, this paper proposes a Query-based Knowledge Sharing framework (QKS) aimed at improving the recognition ability of unseen labels by extracting and sharing key visual knowledge from pre-trained VLP models. ### Main Contributions: 1. **Knowledge Extraction Module Design**: A novel knowledge extraction module is proposed to explore multi-modal knowledge from VLP models and extract key visual cues that match label embeddings. 2. **Prompting Technique for Label Embeddings**: A simple yet effective prompting technique is introduced to provide rich and diverse contexts for each label, generating robust label embeddings for matching with visual features. 3. **Classification Form of Ranking Learning**: Ranking learning is reformulated into a classification form, allowing the use of feature vector magnitudes for label prediction, significantly improving the model's performance in terms of precision, recall, and F1 score. 4. **Query-based Knowledge Sharing Paradigm**: A query-based knowledge sharing paradigm is proposed to explore multi-modal knowledge from pre-trained VLP models for open vocabulary multi-label recognition, significantly outperforming existing methods with a 5.9% and 4.5% mAP improvement on the NUS-WIDE and Open Images datasets, respectively. ### Method Overview: - **Problem Setting**: Defines the image space, seen label set, and unseen label set, and describes the objectives of standard multi-label zero-shot learning tasks and generalized zero-shot learning tasks. - **Framework Overview**: The QKS framework includes a frozen VLP model, a knowledge extraction module, and a knowledge sharing module. The VLP model is used to encode spatial features of input images and semantic embeddings of candidate labels. The knowledge extraction module aggregates key visual knowledge through a set of trainable label-agnostic query tokens, and the knowledge sharing module allows label embeddings to select interesting query tokens as visual cues. - **Feature Extraction**: Details how to use the VLP visual encoder to generate spatial features of images and map them to a high-dimensional space through a linear projection layer. Additionally, a prompt pool technique is proposed to generate rich embeddings for each label using multiple preset prompt templates. - **Knowledge Extraction**: A set of label-agnostic query tokens is designed to aggregate key visual knowledge through a Transformer decoder, with the final query tokens containing information crucial for label recognition. - **Knowledge Sharing**: The extracted key visual knowledge is shared with all labels, allowing them to select the most relevant parts as visual cues, thereby enabling the recognition of unseen labels. - **Classification Form of Ranking Learning**: The matching scores are directly used for the classification loss function instead of the traditional ranking loss, considering the magnitude of feature vectors and improving the model's recognition performance. ### Experimental Results: - **NUS-WIDE Dataset**: QKS achieves the best performance across all metrics, particularly improving the mAP metric by 11.9% and 4.2% in ZSL and GZSL tasks compared to the MKT method. - **Open Images Dataset**: QKS also performs excellently across all metrics, especially improving the mAP metric by 4.5% in the ZSL task compared to the MKT method. ### Conclusion: The proposed QKS framework significantly improves the performance of multi-label zero-shot learning tasks in an open vocabulary setting through innovative knowledge extraction and sharing mechanisms, demonstrating its potential in practical applications.

Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Multi-Label Zero-Shot Learning with Structured Knowledge Graphs

Zero-Shot Visual Question Answering Using Knowledge Graph

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

Semantic-visual shared knowledge graph for zero-shot learning

Deep Ranking for Image Zero-Shot Multi-Label Classification.

Transductive Multi-class and Multi-label Zero-shot Learning

Text as Image: Learning Transferable Adapter for Multi-Label Classification

Transductive Multi-label Zero-shot Learning.

Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

Multi-directional Knowledge Transfer for Few-Shot Learning

Knowledge Distillation from Single to Multi Labels: an Empirical Study

Query2Label: A Simple Transformer Way to Multi-Label Classification

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Retrieval-based Knowledge Augmented Vision Language Pre-training