Abstract:An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at <a class="link-external link-https" href="https://github.com/lizzy8587/CastDet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of Open Vocabulary Detection (OVD) in aerial images. Specifically, the goal is to develop a technique capable of detecting new objects beyond the categories in the training data without the need for extensive resources to collect new annotated data. Current aerial object detectors perform well on specific categories but fail to recognize new categories of objects not seen during training. This limits their applicability in open scenarios. ### Background and Challenges 1. **Dataset Scale and Category Vocabulary**: Existing aerial datasets are smaller in scale and category vocabulary compared to natural image datasets, limiting the detector's scalability. 2. **Background Interference**: Objects in aerial images often resemble the background, making it difficult for detectors to distinguish target objects from background noise. 3. **High Annotation Cost**: Collecting and annotating large-scale aerial images is very costly and requires specialized knowledge. ### Solution To overcome the above challenges, the paper proposes **CastDet**, an open vocabulary object detection method based on a CLIP-activated student-teacher learning framework. The specific contributions are as follows: 1. **Multi-Teacher Self-Learning Mechanism**: CastDet includes a student model and two teacher models. The student model is responsible for training the detector, guided by a localization teacher model and an external teacher model. The localization teacher model is mainly used to discover and locate potential objects, while the external teacher model is used to classify new categories and generate pseudo-labels. 2. **Dynamic Label Queue**: A dynamic label queue is proposed to store and update high-quality pseudo-labels, ensuring the quality of labels during batch training. 3. **Hybrid Training Strategy**: Combines labeled data, unlabeled data, and data from the dynamic label queue for training, gradually expanding the detector's category vocabulary. ### Experimental Results The paper conducts extensive experiments on multiple existing aerial object detection datasets, showing that CastDet performs excellently in open vocabulary detection tasks. For example, on new categories in the VisDroneZSD dataset, CastDet achieved 46.5% mAP, which is 21.0% higher than the current state-of-the-art open vocabulary detector. ### Conclusion This paper is the first to apply open vocabulary object detection technology to aerial images. By proposing the CastDet framework, it effectively addresses the challenge of detecting new category objects in aerial images. This method not only improves detection accuracy and recall but also reduces the reliance on additional annotated data.

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Open-Vocabulary Object Detection with an Open Corpus

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

Open-Vocabulary Camouflaged Object Segmentation

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

OvarNet: Towards Open-Vocabulary Object Attribute Recognition

LOVD: Large-and-Open Vocabulary Object Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection