UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery

Huaxiang Zhang,Kai Liu,Zhongxue Gan,Guo-Niu Zhu
2025-01-03
Abstract:Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: <a class="link-external link-https" href="https://github.com/ValiantDiligent/UAV-DETR" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the object detection in Unmanned Aerial Vehicle (UAV) images (UAV - OD). Specifically, the paper mainly addresses the following problems: 1. **Existing UAV - OD algorithms rely on manually - designed components**: - Most existing UAV - OD algorithms rely on manually - designed components, such as Non - Maximum Suppression (NMS) and anchor boxes generated based on human experience. These components require a large amount of parameter tuning, resulting in complexity and inefficiency. 2. **End - to - end models have poor performance on UAV images**: - Existing end - to - end object detection models (such as DETR) are mainly designed for natural images and perform poorly when processing UAV images, especially in small - object detection and occluded - object detection. 3. **Challenges specific to UAV images**: - The object features in UAV images are more complex, such as small - object size and occlusion, making traditional object detection methods difficult to work effectively. Therefore, a method that can better extract multi - scale features is needed to address these challenges. To solve the above problems, this paper proposes an efficient detection Transformer framework specifically designed for UAV images - **UAV - DETR**. This framework includes the following innovative modules: - **Multi - Scale Feature Fusion with Frequency Enhancement (MSFF - FE)**: By combining spatial - domain and frequency - domain information, it enhances the detection ability of small objects and occluded objects. - **Frequency - Focused Down - Sampling (FD)**: Preserves key spatial details during the down - sampling process. - **Semantic Alignment and Calibration (SAC)**: Aligns and fuses features from different fusion paths to improve detection performance. Through these improvements, UAV - DETR has achieved significant performance improvements on the VisDrone and UAVVaste datasets, especially in terms of Average Precision (AP) and AP50 metrics. In addition, this model also has real - time inference ability and is suitable for UAV object detection tasks in practical applications.