Abstract:Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

Multimodal Few-Shot Learning with Frozen Language Models

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Self-Supervised Open-Ended Classification with Small Visual Language Models

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

Few-shot Learning with Multilingual Language Models

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Multimodal Pretraining from Monolingual to Multilingual

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Multimodal One-Shot Learning of Speech and Images

Multimodal few-shot classification without attribute embedding

Multilingual Few-Shot Learning via Language Model Retrieval

Seal: Advancing Speech Language Models to be Few-Shot Learners

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

Generating Images with Multimodal Language Models