The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

Hongpeng Pan,Shifeng Yi,Shouwei Yang,Lei Qi,Bing Hu,Yi Xu,Yang Yang

2024-06-18

Abstract:This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of alignment between the objects detected by Vision-Language Models (VLM) on specific datasets and the category concepts of interest in the Foundational Few-Shot Object Detection (FSOD) task. This alignment issue affects the zero-shot performance of VLM and the application of pseudo-label-based fine-tuning methods. To solve this problem, the authors propose the VLM+ framework, which integrates a Multimodal Large Language Model (MM-LLM) to generate reference expressions for each category. The best reference expression is then selected through maximum IoU matching, generating pseudo-labels to enhance the training and performance of the VLM. Specifically, the main contributions of the paper include: 1. **Concept Alignment**: Utilizing MM-LLM to generate descriptive reference expressions for each category to improve VLM's understanding of specific object concepts. 2. **Iterative Pseudo-Label Optimization**: Iteratively generating and optimizing pseudo-labels to further enhance VLM's detection performance. 3. **Experimental Validation**: Conducting experiments on the FSOD challenge dataset to validate the effectiveness of the VLM+ framework, ultimately achieving a score of 32.56 mAP in testing. These methods effectively address the concept alignment issue of VLM in specific application scenarios, improving the model's performance in few-shot object detection tasks.

The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection

Revisiting Few-Shot Object Detection with Vision-Language Models

Vlm-Guided Explicit-Implicit Complementary Novel Class Semantic Learning for Few-Shot Object Detection

Semantic Enhanced Few-shot Object Detection

Improved Region Proposal Network for Enhanced Few-Shot Object Detection

Few-shot Object Detection via Improved Classification Features

Few-Shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects

Few-shot Weakly-Supervised Object Detection via Directional Statistics

PS-TTL: Prototype-based Soft-labels and Test-Time Learning for Few-shot Object Detection

MM-FSOD: Meta and metric integrated few-shot object detection

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Few-shot Object Detection with Refined Contrastive Learning

Few-Shot Object Detection in Remote-Sensing Images via Label-Consistent Classifier and Gradual Regression

Fine-Grained Prototypes Distillation for Few-Shot Object Detection

Few-shot Oriented Object Detection with Memorable Contrastive Learning in Remote Sensing Images