VLDadaptor: Domain Adaptive Object Detection with Vision-Language Model Distillation

Junjie Ke,Lihuo He,Bo Han,Jie Li,Di Wang,Xinbo Gao
DOI: https://doi.org/10.1109/tmm.2024.3453061
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domainmixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time. Code is available at https://github.com/GingerCohle/VLDadaptor.
What problem does this paper attempt to address?