Abstract:This paper presents Incremental Vision-Language Object Detection (IVLOD), a novel learning task designed to incrementally adapt pre-trained Vision-Language Object Detection Models (VLODMs) to various specialized domains, while simultaneously preserving their zero-shot generalization capabilities for the generalized domain. To address this new challenge, we present the Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring additional inference costs or a significant increase in memory usage. Comprehensive experiments on COCO and ODinW-13 datasets demonstrate that ZiRa effectively safeguards the zero-shot generalization ability of VLODMs while continuously adapting to new tasks. Specifically, after training on ODinW-13 datasets, ZiRa exhibits superior performance compared to CL-DETR and iDETR, boosting zero-shot generalizability by substantial 13.91 and 8.74 AP, <a class="link-external link-http" href="http://respectively.Our" rel="external noopener nofollow">this http URL</a> code is available at <a class="link-external link-https" href="https://github.com/JarintotionDin/ZiRaGroundingDINO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the incremental adaptation problem of vision - language object detection models (VLODMs) in multiple specific domains while maintaining their zero - shot generalization ability. Specifically, the paper proposes a new task, **Incremental Vision - Language Object Detection (IVLOD)**, whose goal is to maintain the model's ability to effectively process both known and unknown objects while continuously adapting to new downstream tasks. ### Background and challenges 1. **Limitations of existing models**: - Although existing VLODMs have strong zero - shot recognition capabilities, they perform poorly in specific domains (such as aquatic organism recognition in aquariums or remote - sensing image interpretation by drones). - In practical applications, VLODMs are required to be able to adapt to various unforeseen downstream tasks to achieve the required accuracy. 2. **Challenges in incremental learning**: - **Catastrophic forgetting**: When a new task is introduced, the performance of the model on previously learned tasks may drop sharply. - **Maintaining zero - shot generalization ability**: While learning new tasks, it is necessary to maintain the model's ability to recognize objects of unseen classes. ### Solutions To address the above challenges, the paper proposes a new method named **Zero - interference Reparameterizable Adaptation (ZiRa)**. The core innovations of ZiRa include: 1. **Reparameterizable Dual - Branch Structure (RDB)**: - **Low Learning Rate Branch (LLRB)**: Using a lower learning rate, which helps protect the learned knowledge. - **High Learning Rate Branch (HLRB)**: Using a higher learning rate to quickly adapt to new tasks. - **Dynamic balance**: Achieve a balance between stability and plasticity through different learning rates. 2. **Zero - interference Loss (ZiL)**: - **Protect zero - shot generalization ability**: By penalizing the output norm of RDB, ensure that the model does not overly interfere with the existing knowledge when adapting to new tasks. - **Prevent downstream task forgetting**: Further apply ZiL to HLRB to reduce the forgetting of existing tasks during the new task learning process. ### Experimental results The paper conducted comprehensive experiments on the COCO and ODinW - 13 datasets to verify the effectiveness of ZiRa. The specific results are as follows: - **Zero - shot generalization ability**: The zero - shot AP of ZiRa on the COCO dataset is 13.91 and 8.74 higher than that of CL - DETR and iDETR respectively. - **Downstream task adaptability**: The performance of ZiRa on the ODinW - 13 dataset is better than existing methods, demonstrating its incremental learning ability in multiple specific domains. ### Summary This paper successfully solves the incremental adaptation problem of VLODMs in specific domains while maintaining their zero - shot generalization ability by introducing the IVLOD task and ZiRa method. This provides a new solution for the wide adaptability of vision - language object detection models in practical applications.

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Zero-Shot Detection with Transferable Object Proposal Mechanism.

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

Efficient Feature Distillation for Zero-shot Annotation Object Detection

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

End-to-End Zero-Shot HOI Detection Via Vision and Language Knowledge Distillation

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models

Incrementally Zero-Shot Detection by an Extreme Value Analyzer

Domain-Aware Continual Zero-Shot Learning