OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Guoting Wei,Xia Yuan,Yu Liu,Zhenhao Shang,Kelu Yao,Chao Li,Qingsen Yan,Chunxia Zhao,Haokui Zhang,Rong Xiao
2024-08-22
Abstract:Aerial object detection has been a hot topic for many years due to its wide application requirements. However, most existing approaches can only handle predefined categories, which limits their applicability for the open scenarios in real-world. In this paper, we extend aerial object detection to open scenarios by exploiting the relationship between image and text, and propose OVA-DETR, a high-efficiency open-vocabulary detector for aerial images. Specifically, based on the idea of image-text alignment, we propose region-text contrastive loss to replace the category regression loss in the traditional detection framework, which breaks the category limitation. Then, we propose Bidirectional Vision-Language Fusion (Bi-VLF), which includes a dual-attention fusion encoder and a multi-level text-guided Fusion Decoder. The dual-attention fusion encoder enhances the feature extraction process in the encoder part. The multi-level text-guided Fusion Decoder is designed to improve the detection ability for small objects, which frequently appear in aerial object detection scenarios. Experimental results on three widely used benchmark datasets show that our proposed method significantly improves the mAP and recall, while enjoying faster inference speed. For instance, in zero shot detection experiments on DIOR, the proposed OVA-DETR outperforms DescReg and YOLO-World by 37.4% and 33.1%, respectively, while achieving 87 FPS inference speed, which is 7.9x faster than DescReg and 3x faster than YOLO-world. The code is available at <a class="link-external link-https" href="https://github.com/GT-Wei/OVA-DETR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of Open Vocabulary Aerial Object Detection in aerial images. Specifically: 1. **Breaking the Predefined Category Limitation**: - Most existing aerial object detection methods can only handle categories predefined in the training set, which limits their application in real-world open scenarios. The paper proposes an image-text alignment method by introducing a region-text contrastive loss to replace the traditional classification regression loss in the detection framework, thereby breaking the category limitation. 2. **Improving Small Object Detection Performance**: - Small-scale objects frequently appear in aerial images, posing a challenge for detection. The paper proposes a Bi-directional Vision Language Fusion (Bi-VLF) method, including a Dual Attention Fusion Encoder (DAFE) and a Multi-level Text-guided Fusion Decoder (MTFD), to enhance the detection capability of small objects. 3. **Enhancing Detection Efficiency**: - Existing open vocabulary detection methods often rely on complex architectures, neglecting the need for efficiency. The proposed OV A-DETR is based on the lightweight framework RT-DETR and enhances the feature extraction process through cross-modal fusion, achieving efficient and fast inference speed. With these improvements, OV A-DETR demonstrates significant performance enhancement in zero-shot detection tasks across multiple benchmark datasets and achieves state-of-the-art performance in traditional aerial object detection tasks.