Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Hongbo Sun,Xiangteng He,Jiahuan Zhou,Yuxin Peng
DOI: https://doi.org/10.1145/3581783.3612403
2023-01-01
Abstract:Large-scale pre-trained vision-language (VL) models have shown powerful generic representation capabilities for adapting to downstream tasks with limited training data, which are data-efficient solutions to various applications such as image recognition. In order to enhance the adaption performance, most existing methods attempt to introduce learnable vectors into the text prompt to generate adaptive classification weights for the class in the downstream task. However, they generally focus on the text side while neglecting adaptive visual feature generation on the image side, which is insufficient to fit the downstream task data. In this paper, we propose fine-grained visual prompt learning (FG-VPL) of vision-language models for image recognition with few training samples, and the main contributions are: (1) Fine-grained visual prompt is introduced into the image encoder of the vision-language model for focusing on the target object and conducting information interaction within the object, which facilitates generating discriminative visual features for image recognition. (2) A two-pathway adaptive recognition module is proposed to narrow the domain gap and utilize both the cross-modal knowledge of the vision-language model and the visual information of the few-sample training set for classifying images with the help of feature adapters. We conduct extensive experiments on 11 image recognition benchmark datasets under the few training samples setting, which demonstrate that our proposed approach can achieve state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/FG-VPL_ACMMM2023.
What problem does this paper attempt to address?