CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification
Jie Yan,Yuxiang Xie,Yanming Guo,Yingmei Wei,Xiaoping Zhang,Xidao Luan
DOI: https://doi.org/10.1007/s13735-023-00286-5
2023-08-24
International Journal of Multimedia Information Retrieval
Abstract:Few-shot image classification aims at learning to generalize to unseen new categories from a few training samples. Transfer learning is one prominent approach to the task, which first learns a backbone from the base classes and then trains a classifier on new classes with the prior learned knowledge. Typically, the convolutional neural network (CNN) is the preferred backbone. However, when the samples are limited, the representation ability of the feature extracted by CNN will decrease, thus leading to the performance degradation of few-shot image classification. Recently, the pre-trained large-scale vision-language model like CLIP has shown non-trivial potential, which can be used as a backbone for zero or few-shot transfer on a series of downstream tasks with the prompt. To fully explore the few-shot image classification performance of vision-language models, we propose CoCoOpter, a novel "pre-training + prompt tuning + fine-tuning" paradigm based on CLIP. CoCoOpter alleviates the overfitting and ensures generalizability in unseen new categories. Specifically, it learns an input-specific neural network to relieve overfitting by drawing attention away from a specific category to each specific input sample. Then, to establish connection between the visual and textual signals, it introduces the previously learned visual representations to perform automatic prompt tuning in the middle of the pre-trained CLIP, enabling learning input-specified prompt vectors. Moreover, two learnable lightweight neural networks are added at the end of CLIP to guide information propagation between different classes by fine-tuning both the visual and textual features. We perform extensive experiments on 11 image classification datasets. The results show that CoCoOpter is more generalizable in unseen classes and achieves superior few-shot classification performance with a straightforward design.
computer science, artificial intelligence, software engineering