Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li,Weiwei Guo,Xue Yang,Ning Liao,Dunyun He,Jiaqi Zhou,Wenxian Yu
2024-10-29
Abstract:An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at <a class="link-external link-https" href="https://github.com/lizzy8587/CastDet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the problem of Open Vocabulary Detection (OVD) in aerial images. Specifically, the goal is to develop a technique capable of detecting new objects beyond the categories in the training data without the need for extensive resources to collect new annotated data. Current aerial object detectors perform well on specific categories but fail to recognize new categories of objects not seen during training. This limits their applicability in open scenarios. ### Background and Challenges 1. **Dataset Scale and Category Vocabulary**: Existing aerial datasets are smaller in scale and category vocabulary compared to natural image datasets, limiting the detector's scalability. 2. **Background Interference**: Objects in aerial images often resemble the background, making it difficult for detectors to distinguish target objects from background noise. 3. **High Annotation Cost**: Collecting and annotating large-scale aerial images is very costly and requires specialized knowledge. ### Solution To overcome the above challenges, the paper proposes **CastDet**, an open vocabulary object detection method based on a CLIP-activated student-teacher learning framework. The specific contributions are as follows: 1. **Multi-Teacher Self-Learning Mechanism**: CastDet includes a student model and two teacher models. The student model is responsible for training the detector, guided by a localization teacher model and an external teacher model. The localization teacher model is mainly used to discover and locate potential objects, while the external teacher model is used to classify new categories and generate pseudo-labels. 2. **Dynamic Label Queue**: A dynamic label queue is proposed to store and update high-quality pseudo-labels, ensuring the quality of labels during batch training. 3. **Hybrid Training Strategy**: Combines labeled data, unlabeled data, and data from the dynamic label queue for training, gradually expanding the detector's category vocabulary. ### Experimental Results The paper conducts extensive experiments on multiple existing aerial object detection datasets, showing that CastDet performs excellently in open vocabulary detection tasks. For example, on new categories in the VisDroneZSD dataset, CastDet achieved 46.5% mAP, which is 21.0% higher than the current state-of-the-art open vocabulary detector. ### Conclusion This paper is the first to apply open vocabulary object detection technology to aerial images. By proposing the CastDet framework, it effectively addresses the challenge of detecting new category objects in aerial images. This method not only improves detection accuracy and recall but also reduces the reliance on additional annotated data.