Abstract:The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.

What problem does this paper attempt to address?

The paper attempts to address the problem of **Few-Shot Image Classification and Segmentation (FS-CS)**. Specifically, the goal of the paper is to classify and segment target objects in query images given a small number of target class examples. ### Problem Background In the field of computer vision, few-shot image classification and segmentation is a challenging task, especially when data is very limited. Traditional solutions often rely on intensive training processes, such as meta-learning or transfer learning, but these methods have limitations such as overfitting, high computational cost, and the need for specific dataset fine-tuning. ### Main Contributions of the Paper 1. **Proposed a training-free strategy**: Introduced the Vision-Instructed Segmentation and Evaluation (VISE) modular framework, which leverages off-the-shelf vision tools and vision-language models (VLMs) to address the FS-CS problem. 2. **Redefined FS-CS as a Visual Question Answering (VQA) task**: By using visual and textual prompts, the VLM can interact with vision tools (such as YOLO and Segment Anything Model (SAM)) to achieve precise classification and segmentation. 3. **Achieved state-of-the-art performance on Pascal-5i and COCO-20i datasets**: These results demonstrate the effectiveness of combining VLM reasoning capabilities with off-the-shelf vision tools. ### Method Overview 1. **Framework Overview**: First, sample an N-way K-shot FS-CS task from the database, then use object detection tools (such as YOLO) to obtain bounding boxes in the query images. Next, transform the original FS-CS task into a multiple-choice VQA task based on the support set. Finally, use image segmentation tools (such as SAM) to generate the final segmentation masks for the query set. 2. **Detailed Construction of the VQA Task**: Gradually refine the model's understanding and selection process through structured multiple-choice examples. Each bounding box is accompanied by detailed descriptions to assist the VLM in classification. 3. **Use of Vision Tools**: Utilize state-of-the-art vision tools (such as YOLOv8 for object detection and SAM for image segmentation) to transform the VLM from a passive observer to an active participant in the FS-CS process. ### Experimental Results The paper conducted experiments on the widely used Pascal-5i and COCO-20i datasets, comparing the proposed method with existing few-shot classification and segmentation methods. The results show that this method significantly outperforms existing methods in terms of segmentation mIoU metrics and also performs well in classification accuracy. ### Conclusion By redefining the FS-CS task as a VQA task and combining the reasoning capabilities of VLMs with off-the-shelf vision tools, the proposed method in this paper performs excellently in addressing the few-shot image classification and segmentation problem, providing new ideas and methods for research in this field.

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Few-Shot Classification & Segmentation Using Large Language Models Agent

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Iterative Few-shot Semantic Segmentation from Image Label Text

Language-guided Few-shot Semantic Segmentation

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models

Improved Few-Shot Image Classification Through Multiple-Choice Questions

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

A Visual Question Answering System using YOLO Model

Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

LOIS: Looking Out of Instance Semantics for Visual Question Answering

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Multi-source Multi-level Attention Networks for Visual Question Answering

Good Questions Help Zero-Shot Image Reasoning