The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

Hongpeng Pan,Shifeng Yi,Shouwei Yang,Lei Qi,Bing Hu,Yi Xu,Yang Yang
2024-06-18
Abstract:This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of alignment between the objects detected by Vision-Language Models (VLM) on specific datasets and the category concepts of interest in the Foundational Few-Shot Object Detection (FSOD) task. This alignment issue affects the zero-shot performance of VLM and the application of pseudo-label-based fine-tuning methods. To solve this problem, the authors propose the VLM+ framework, which integrates a Multimodal Large Language Model (MM-LLM) to generate reference expressions for each category. The best reference expression is then selected through maximum IoU matching, generating pseudo-labels to enhance the training and performance of the VLM. Specifically, the main contributions of the paper include: 1. **Concept Alignment**: Utilizing MM-LLM to generate descriptive reference expressions for each category to improve VLM's understanding of specific object concepts. 2. **Iterative Pseudo-Label Optimization**: Iteratively generating and optimizing pseudo-labels to further enhance VLM's detection performance. 3. **Experimental Validation**: Conducting experiments on the FSOD challenge dataset to validate the effectiveness of the VLM+ framework, ultimately achieving a score of 32.56 mAP in testing. These methods effectively address the concept alignment issue of VLM in specific application scenarios, improving the model's performance in few-shot object detection tasks.