Region Attention Fine-tuning with CLIP for Few-shot Classification

Guangxing Wu,Junxi Chen,Qiu Li,Wentao Zhang,Wei-Shi Zheng,Ruixuan Wang
DOI: https://doi.org/10.1109/icme57554.2024.10688204
2024-01-01
Abstract:With the advancements in visual language models such as CLIP and their strong performance in zero-shot recognition, numerous CLIP-based methods have emerged in the field of few-shot classification. However, many of them do not fully leverage the abundant feature information within the CLIP visual encoder and overlook the issue of varying region-specific importance for image classification across different datasets. To address these limitations, we present an attention pooling-based framework for few-shot fine-tuning. Our framework enables the model to learn task-specific attention weights for image regions, while also incorporating background features and a consistency constraint to enhance training. As a result, our approach outperforms the state-of-the-art approaches on 11 benchmarks, demonstrating its effectiveness.
What problem does this paper attempt to address?