Abstract:Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation in X-shot scenarios. Experimental results on four benchmark datasets show a 7%-15% improvement over state-of-the-art methods, with ablation studies demonstrating enhancement of CLIP features.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the two main challenges encountered in few - shot cross - modal retrieval (X - shot Cross - modal Retrieval, CMR): 1. **Feature Degradation in the Target Domain**: Although CLIP - based CMR methods perform well in few - shot learning, when mapping the features extracted from the source domain to the target domain, it usually results in feature degradation. This is because during the transfer process from the source domain to the target domain, the model has difficulty in effectively preserving the quality of the features, resulting in the performance of the mapped features being inferior to the original CLIP - extracted features. 2. **Extreme Data Imbalance**: In the few - shot learning scenario, instances in the target domain are missing during the training phase, which makes it difficult to model cross - modal correlations in the target domain. Due to the huge difference in the number of samples between the source domain and the target domain, the model tends to be biased towards the source domain for training, thus performing poorly on the target domain. To solve these problems, the authors propose FLEX - CLIP, a CLIP model enhanced by a feature - level generation network, aiming to improve the effect of few - shot cross - modal retrieval through two key stages: - **Multimodal Feature Generation**: A composite multimodal generation architecture is designed, combining the advantages of variational auto - encoder (VAE) and generative adversarial network (GAN) to generate pseudo - samples to alleviate the data imbalance problem. GAN generates pseudo - samples according to class embeddings, while VAE captures feature distribution patterns by encoding and reconstructing real samples. - **Common Space Projection**: A gated residual network is used to selectively fuse the original features and the mapped features, making full use of the semantic information in CLIP and significantly reducing the feature degradation problem. Through these two stages, FLEX - CLIP can effectively address the feature degradation and data imbalance problems in few - shot cross - modal retrieval, thus achieving significant performance improvements on multiple benchmark datasets.

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Selective Vision-Language Subspace Projection for Few-shot CLIP

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Visual-Language Collaborative Representation Network for Broad-Domain Few-Shot Image Classification

When CLIP meets cross-modal hashing retrieval: A new strong baseline

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Adaptive CLIP for open-domain 3D model retrieval

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Injecting Image Details into CLIP's Feature Space

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models