Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Zhengwei Yang,Yuke Li,Qiang Sun,Basura Fernando,Heng Huang,Zheng Wang

2024-10-15

Abstract:Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the problem of Few-Shot Learning (FSL) in multimodal data. Specifically, most existing few-shot learning methods focus on a single-modal setting, where the model generalizes to unseen data using only a small number of labeled samples from the same modality during training. However, real-world data is often multimodal, such as data collected from different types of sensors or imaging protocols like images, videos, and audio. Therefore, single-modal methods have limitations in practical applications. To bridge this gap, the paper introduces the task of Cross-modal Few-Shot Learning (CFSL), which aims to recognize instances from multiple modalities when only a few labeled samples are available. This task is more challenging than traditional few-shot learning because each modality has unique visual characteristics and structural properties, making feature extraction and alignment more complex. To address these challenges, the authors propose a Generative Transfer Learning (GTL) framework, which includes two stages: 1. **Generative Learning Stage**: Training on a large amount of single-modal data, focusing on capturing inherent concepts and variations in visual content within the modality. 2. **Recognition Stage**: Transfer learning on a small amount of multimodal data, freezing the generative module to maintain the stability of learned representations and prevent overfitting to the limited multimodal samples. The GTL framework achieves effective transfer from single-modal data to multimodal data by jointly estimating the latent shared concepts across modalities and the perturbations within modalities. Experimental results show that this method outperforms existing methods on four different multimodal datasets (SKETCHY, TU-BERLIN, MASK1K, and SKSF-A), demonstrating its superior performance in handling multimodal few-shot learning tasks.

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Adaptive Cross-Modal Few-Shot Learning

Cross Modal Few-Shot Contextual Transfer for Heterogenous Image Classification

Knowledge Graph Enhanced Multimodal Learning for Few-shot Visual Recognition

Knowledge Transduction for Cross-Domain Few-Shot Learning

GCT: Graph Co-Training for Semi-Supervised Few-Shot Learning

Multimodal Prototypical Networks for Few-shot Learning

Multimodal Few-Shot Learning with Frozen Language Models

Cross-Domain Cross-Set Few-Shot Learning via Learning Compact and Aligned Representations

A Transferable Generative Framework for Multi-Label Zero-Shot Learning

Multi-directional Knowledge Transfer for Few-Shot Learning

Task Context Transformer and GCN for Few-Shot Learning of Cross-Domain

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Dual-stream Multi-Modal Graph Neural Network for Few-Shot Learning

Cross-domain self-supervised few-shot learning via multiple crops with teacher-student network

Enhancing few-shot lifelong learning through fusion of cross-domain knowledge

Improving Cross-domain Few-shot Classification with Multilayer Perceptron

A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters

A conditional GAN-based approach for enhancing transfer learning performance in few-shot HCR tasks

Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation

Dual Adaptive Representation Alignment for Cross-domain Few-shot Learning