Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

Tian Meng,Yang Tao,Ruilin Lyu,Wuliang Yin
2024-03-15
Abstract:The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of **Few-Shot Image Classification and Segmentation (FS-CS)**. Specifically, the goal of the paper is to classify and segment target objects in query images given a small number of target class examples. ### Problem Background In the field of computer vision, few-shot image classification and segmentation is a challenging task, especially when data is very limited. Traditional solutions often rely on intensive training processes, such as meta-learning or transfer learning, but these methods have limitations such as overfitting, high computational cost, and the need for specific dataset fine-tuning. ### Main Contributions of the Paper 1. **Proposed a training-free strategy**: Introduced the Vision-Instructed Segmentation and Evaluation (VISE) modular framework, which leverages off-the-shelf vision tools and vision-language models (VLMs) to address the FS-CS problem. 2. **Redefined FS-CS as a Visual Question Answering (VQA) task**: By using visual and textual prompts, the VLM can interact with vision tools (such as YOLO and Segment Anything Model (SAM)) to achieve precise classification and segmentation. 3. **Achieved state-of-the-art performance on Pascal-5i and COCO-20i datasets**: These results demonstrate the effectiveness of combining VLM reasoning capabilities with off-the-shelf vision tools. ### Method Overview 1. **Framework Overview**: First, sample an N-way K-shot FS-CS task from the database, then use object detection tools (such as YOLO) to obtain bounding boxes in the query images. Next, transform the original FS-CS task into a multiple-choice VQA task based on the support set. Finally, use image segmentation tools (such as SAM) to generate the final segmentation masks for the query set. 2. **Detailed Construction of the VQA Task**: Gradually refine the model's understanding and selection process through structured multiple-choice examples. Each bounding box is accompanied by detailed descriptions to assist the VLM in classification. 3. **Use of Vision Tools**: Utilize state-of-the-art vision tools (such as YOLOv8 for object detection and SAM for image segmentation) to transform the VLM from a passive observer to an active participant in the FS-CS process. ### Experimental Results The paper conducted experiments on the widely used Pascal-5i and COCO-20i datasets, comparing the proposed method with existing few-shot classification and segmentation methods. The results show that this method significantly outperforms existing methods in terms of segmentation mIoU metrics and also performs well in classification accuracy. ### Conclusion By redefining the FS-CS task as a VQA task and combining the reasoning capabilities of VLMs with off-the-shelf vision tools, the proposed method in this paper performs excellently in addressing the few-shot image classification and segmentation problem, providing new ideas and methods for research in this field.