Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Jieren Deng,Haojian Zhang,Kun Ding,Jianhua Hu,Xingxuan Zhang,Yunkuan Wang
2024-10-16
Abstract:This paper presents Incremental Vision-Language Object Detection (IVLOD), a novel learning task designed to incrementally adapt pre-trained Vision-Language Object Detection Models (VLODMs) to various specialized domains, while simultaneously preserving their zero-shot generalization capabilities for the generalized domain. To address this new challenge, we present the Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring additional inference costs or a significant increase in memory usage. Comprehensive experiments on COCO and ODinW-13 datasets demonstrate that ZiRa effectively safeguards the zero-shot generalization ability of VLODMs while continuously adapting to new tasks. Specifically, after training on ODinW-13 datasets, ZiRa exhibits superior performance compared to CL-DETR and iDETR, boosting zero-shot generalizability by substantial 13.91 and 8.74 AP, <a class="link-external link-http" href="http://respectively.Our" rel="external noopener nofollow">this http URL</a> code is available at <a class="link-external link-https" href="https://github.com/JarintotionDin/ZiRaGroundingDINO" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the incremental adaptation problem of vision - language object detection models (VLODMs) in multiple specific domains while maintaining their zero - shot generalization ability. Specifically, the paper proposes a new task, **Incremental Vision - Language Object Detection (IVLOD)**, whose goal is to maintain the model's ability to effectively process both known and unknown objects while continuously adapting to new downstream tasks. ### Background and challenges 1. **Limitations of existing models**: - Although existing VLODMs have strong zero - shot recognition capabilities, they perform poorly in specific domains (such as aquatic organism recognition in aquariums or remote - sensing image interpretation by drones). - In practical applications, VLODMs are required to be able to adapt to various unforeseen downstream tasks to achieve the required accuracy. 2. **Challenges in incremental learning**: - **Catastrophic forgetting**: When a new task is introduced, the performance of the model on previously learned tasks may drop sharply. - **Maintaining zero - shot generalization ability**: While learning new tasks, it is necessary to maintain the model's ability to recognize objects of unseen classes. ### Solutions To address the above challenges, the paper proposes a new method named **Zero - interference Reparameterizable Adaptation (ZiRa)**. The core innovations of ZiRa include: 1. **Reparameterizable Dual - Branch Structure (RDB)**: - **Low Learning Rate Branch (LLRB)**: Using a lower learning rate, which helps protect the learned knowledge. - **High Learning Rate Branch (HLRB)**: Using a higher learning rate to quickly adapt to new tasks. - **Dynamic balance**: Achieve a balance between stability and plasticity through different learning rates. 2. **Zero - interference Loss (ZiL)**: - **Protect zero - shot generalization ability**: By penalizing the output norm of RDB, ensure that the model does not overly interfere with the existing knowledge when adapting to new tasks. - **Prevent downstream task forgetting**: Further apply ZiL to HLRB to reduce the forgetting of existing tasks during the new task learning process. ### Experimental results The paper conducted comprehensive experiments on the COCO and ODinW - 13 datasets to verify the effectiveness of ZiRa. The specific results are as follows: - **Zero - shot generalization ability**: The zero - shot AP of ZiRa on the COCO dataset is 13.91 and 8.74 higher than that of CL - DETR and iDETR respectively. - **Downstream task adaptability**: The performance of ZiRa on the ODinW - 13 dataset is better than existing methods, demonstrating its incremental learning ability in multiple specific domains. ### Summary This paper successfully solves the incremental adaptation problem of VLODMs in specific domains while maintaining their zero - shot generalization ability by introducing the IVLOD task and ZiRa method. This provides a new solution for the wide adaptability of vision - language object detection models in practical applications.