Abstract:The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of modality insufficiency in few-shot learning. Traditional few-shot learning methods typically utilize data from a single modality (such as images), which may not be sufficient to comprehensively describe a concept class. In contrast, humans can leverage cross-modal information (such as visual, auditory, and linguistic) when learning new concepts. Therefore, this paper proposes a cross-modal few-shot learning method that improves the performance of single-modal tasks by combining data from different modalities. Specifically, the authors demonstrate that by reading textual descriptions about dogs and listening to dog barks, a better visual dog classifier can be constructed. They utilize the cross-modal encoders of recent multimodal foundation models (such as CLIP), which can map different modalities into the same representation space. By using data from different modalities as additional few-shot examples, the authors propose a simple cross-modal adaptation strategy, transforming the "n-shot" problem into an "(n+1)-shot" problem. This method not only improves the performance of the classifier but also further enhances the effectiveness of existing few-shot learning methods. ### Main Contributions 1. **Cross-Modal Few-Shot Learning**: A simple yet effective cross-modal adaptation strategy is proposed, which improves the performance of single-modal tasks by using data from different modalities as additional training samples. 2. **Multimodal Benchmark Dataset**: The first cross-modal few-shot learning benchmark dataset (ImageNet-ESC) containing audio and images is constructed to evaluate the effectiveness of cross-modal learning. 3. **Performance Improvement**: Experimental results show that this method achieves state-of-the-art performance in various few-shot learning tasks, especially in low-data scenarios (1-shot and 2-shot). ### Experimental Validation - **Visual-Language Adaptation**: Experiments were conducted on multiple image classification datasets, and the results show that the cross-modal linear probe outperforms existing methods in all few-shot settings. - **Visual-Audio Adaptation**: Experiments were conducted on the constructed ImageNet-ESC dataset, verifying that listening to dog barks can improve the performance of the visual dog classifier. ### Conclusion By introducing cross-modal information, this paper successfully addresses the issue of modality insufficiency in traditional few-shot learning, providing new ideas and methods for few-shot learning. This method not only performs well in visual tasks but can also be extended to tasks of other modalities, showing broad application prospects.

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Adaptive Cross-Modal Few-Shot Learning

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal few-shot classification without attribute embedding

Multimodal Prototypical Networks for Few-shot Learning

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Multimodal One-Shot Learning of Speech and Images

Multimodal variational contrastive learning for few-shot classification

Few-shot Learning for Multi-Modality Tasks

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

A streamlined Approach to Multimodal Few-Shot Class Incremental Learning for Fine-Grained Datasets

Multimodal Few-Shot Learning with Frozen Language Models

VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Multi-Modal Adapter for Vision-Language Models

Dual-stream Multi-Modal Graph Neural Network for Few-Shot Learning

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training