Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Jin Wang,Bingfeng Zhang,Jian Pang,Honglong Chen,Weifeng Liu
2024-05-14
Abstract:Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper addresses the problem of few-shot segmentation, where limited labeled information of unseen classes makes it challenging for models to make accurate predictions. Existing methods mainly rely on high-level features extracted from frozen visual encoders to generate prior guidance information, but this representation has the drawbacks of coarse granularity and poor generalization ability to new classes. The paper proposes using the Contrastive Language-Image Pretraining model (CLIP) to generate more reliable prior information and enhance the model's generalization capability. Specifically, two training-free prior information generation strategies are designed, which utilize CLIP's visual-text alignment ability and higher-order attention matrix to refine initial prior information, thereby improving the quality of prior guidance. Experiments show that this approach achieves significant performance improvements on the PASCAL-5i and COCO-20i datasets, and achieves a new state-of-the-art performance.