Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin,Samuel Yu,Zhiyi Kuang,Deepak Pathak,Deva Ramanan
2024-08-28
Abstract:The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of modality insufficiency in few-shot learning. Traditional few-shot learning methods typically utilize data from a single modality (such as images), which may not be sufficient to comprehensively describe a concept class. In contrast, humans can leverage cross-modal information (such as visual, auditory, and linguistic) when learning new concepts. Therefore, this paper proposes a cross-modal few-shot learning method that improves the performance of single-modal tasks by combining data from different modalities. Specifically, the authors demonstrate that by reading textual descriptions about dogs and listening to dog barks, a better visual dog classifier can be constructed. They utilize the cross-modal encoders of recent multimodal foundation models (such as CLIP), which can map different modalities into the same representation space. By using data from different modalities as additional few-shot examples, the authors propose a simple cross-modal adaptation strategy, transforming the "n-shot" problem into an "(n+1)-shot" problem. This method not only improves the performance of the classifier but also further enhances the effectiveness of existing few-shot learning methods. ### Main Contributions 1. **Cross-Modal Few-Shot Learning**: A simple yet effective cross-modal adaptation strategy is proposed, which improves the performance of single-modal tasks by using data from different modalities as additional training samples. 2. **Multimodal Benchmark Dataset**: The first cross-modal few-shot learning benchmark dataset (ImageNet-ESC) containing audio and images is constructed to evaluate the effectiveness of cross-modal learning. 3. **Performance Improvement**: Experimental results show that this method achieves state-of-the-art performance in various few-shot learning tasks, especially in low-data scenarios (1-shot and 2-shot). ### Experimental Validation - **Visual-Language Adaptation**: Experiments were conducted on multiple image classification datasets, and the results show that the cross-modal linear probe outperforms existing methods in all few-shot settings. - **Visual-Audio Adaptation**: Experiments were conducted on the constructed ImageNet-ESC dataset, verifying that listening to dog barks can improve the performance of the visual dog classifier. ### Conclusion By introducing cross-modal information, this paper successfully addresses the issue of modality insufficiency in traditional few-shot learning, providing new ideas and methods for few-shot learning. This method not only performs well in visual tasks but can also be extended to tasks of other modalities, showing broad application prospects.