Abstract:Recently, zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance diversity in the generated images, we leverage a pre-trained large language model to generate diverse prompts. Employing a pre-trained multi-modal CLIP model as a discriminator, we assess whether the generated images accurately represent the target classes. This enables automatic filtering of inaccurately generated images, preserving classifier accuracy. To refine text prompts for more precise and effective multi-label object generation, we introduce a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. Additionally, to enhance visual features on the target task while maintaining the generalization of original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing global dependencies between multiple objects more effectively. Extensive experimental results validate the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods.

Heuristic Once Learning for Image & Text Duality Information Processing.

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Multimodal One-Shot Learning of Speech and Images

Learning Paired-associate Images with An Unsupervised Deep Learning Architecture

Dual-stream Multi-Modal Graph Neural Network for Few-Shot Learning

Learning from One and Only One Shot

Deep Multiple Instance Learning for Zero-Shot Image Tagging

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Unified Contrastive Learning in Image-Text-Label Space

Multi-Level Semantic Feature Augmentation for One-Shot Learning

One-Shot Learning for Language Modelling

DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

Unsupervised One-shot Learning of Both Specific Instances and Generalised Classes with a Hippocampal Architecture

DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations

Multimodal Prototypical Networks for Few-shot Learning

Dual Learning: Theoretical Study and an Algorithmic Extension

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

Few-shot Learning for Multi-Modality Tasks

Hierarchical Vision and Language Transformer for Efficient Visual Dialog