Less is More: A Closer Look at Semantic-based Few-Shot Learning

Chunpeng Zhou,Haishuai Wang,Xilu Yuan,Zhi Yu,Jiajun Bu
2024-03-24
Abstract:Few-shot Learning aims to learn and distinguish new categories with a very limited number of available images, presenting a significant challenge in the realm of deep learning. Recent researchers have sought to leverage the additional textual or linguistic information of these rare categories with a pre-trained language model to facilitate learning, thus partially alleviating the problem of insufficient supervision signals. However, the full potential of the textual information and pre-trained language model have been underestimated in the few-shot learning till now, resulting in limited performance enhancements. To address this, we propose a simple but effective framework for few-shot learning tasks, specifically designed to exploit the textual information and language model. In more detail, we explicitly exploit the zero-shot capability of the pre-trained language model with the learnable prompt. And we just add the visual feature with the textual feature for inference directly without the intricate designed fusion modules in previous works. Additionally, we apply the self-ensemble and distillation to further enhance these components. Our extensive experiments conducted across four widely used few-shot datasets demonstrate that our simple framework achieves impressive results. Particularly noteworthy is its outstanding performance in the 1-shot learning task, surpassing state-of-the-art methods by an average of 3.0\% in classification accuracy. \footnote{We will make the source codes of the proposed framework publicly available upon acceptance. }.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the key challenge in **Few-shot Learning (FSL)**, which is how to recognize new categories with only a few samples. Specifically: 1. **Utilizing Semantic Information**: - Existing FSL methods attempt to introduce additional semantic or linguistic information through pre-trained language models, but these methods often design complex multimodal fusion modules, thereby neglecting the powerful generalization capabilities of pre-trained language models. - This paper proposes a simple and effective framework that directly leverages the zero-shot capabilities of pre-trained language models and enhances the model's generalization ability through learnable prompts. 2. **Simplifying Model Structure**: - Unlike previous complex multimodal fusion mechanisms, the proposed method directly adds visual features to textual semantic features, thus avoiding the impact of complex structures on the generalization ability of pre-trained language models. - Additionally, this framework employs self-ensemble and self-distillation techniques to further improve performance. 3. **Experimental Validation**: - Extensive experiments were conducted on four widely used FSL datasets, and the results show that this method improves classification accuracy by an average of 3.3% in 1-shot learning tasks, significantly outperforming existing methods. In summary, this paper effectively addresses the issue of insufficient supervision signals in few-shot learning by simplifying the model structure, fully utilizing the capabilities of pre-trained language models, and adopting advanced self-ensemble and self-distillation techniques.