OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Guoting Wei,Xia Yuan,Yu Liu,Zhenhao Shang,Kelu Yao,Chao Li,Qingsen Yan,Chunxia Zhao,Haokui Zhang,Rong Xiao

2024-08-22

Abstract:Aerial object detection has been a hot topic for many years due to its wide application requirements. However, most existing approaches can only handle predefined categories, which limits their applicability for the open scenarios in real-world. In this paper, we extend aerial object detection to open scenarios by exploiting the relationship between image and text, and propose OVA-DETR, a high-efficiency open-vocabulary detector for aerial images. Specifically, based on the idea of image-text alignment, we propose region-text contrastive loss to replace the category regression loss in the traditional detection framework, which breaks the category limitation. Then, we propose Bidirectional Vision-Language Fusion (Bi-VLF), which includes a dual-attention fusion encoder and a multi-level text-guided Fusion Decoder. The dual-attention fusion encoder enhances the feature extraction process in the encoder part. The multi-level text-guided Fusion Decoder is designed to improve the detection ability for small objects, which frequently appear in aerial object detection scenarios. Experimental results on three widely used benchmark datasets show that our proposed method significantly improves the mAP and recall, while enjoying faster inference speed. For instance, in zero shot detection experiments on DIOR, the proposed OVA-DETR outperforms DescReg and YOLO-World by 37.4% and 33.1%, respectively, while achieving 87 FPS inference speed, which is 7.9x faster than DescReg and 3x faster than YOLO-world. The code is available at <a class="link-external link-https" href="https://github.com/GT-Wei/OVA-DETR" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of Open Vocabulary Aerial Object Detection in aerial images. Specifically: 1. **Breaking the Predefined Category Limitation**: - Most existing aerial object detection methods can only handle categories predefined in the training set, which limits their application in real-world open scenarios. The paper proposes an image-text alignment method by introducing a region-text contrastive loss to replace the traditional classification regression loss in the detection framework, thereby breaking the category limitation. 2. **Improving Small Object Detection Performance**: - Small-scale objects frequently appear in aerial images, posing a challenge for detection. The paper proposes a Bi-directional Vision Language Fusion (Bi-VLF) method, including a Dual Attention Fusion Encoder (DAFE) and a Multi-level Text-guided Fusion Decoder (MTFD), to enhance the detection capability of small objects. 3. **Enhancing Detection Efficiency**: - Existing open vocabulary detection methods often rely on complex architectures, neglecting the need for efficiency. The proposed OV A-DETR is based on the lightweight framework RT-DETR and enhances the feature extraction process through cross-modal fusion, achieving efficient and fast inference speed. With these improvements, OV A-DETR demonstrates significant performance enhancement in zero-shot detection tasks across multiple benchmark datasets and achieves state-of-the-art performance in traditional aerial object detection tasks.

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Object Detection for UAV Aerial Scenarios Based on Vectorized IOU

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Open-Vocabulary Object Detection with an Open Corpus

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

AYDIV: Adaptable Yielding 3D Object Detection via Integrated Contextual Vision Transformer

YOLO-World: Real-Time Open-Vocabulary Object Detection