Abstract:The significant differences in target scales of remote sensing images lead to remarkable variations in visual features, posing significant challenges for feature extraction, fusion, regression, and classification. For example, models frequently struggle to capture features of targets across all scales, inadequately consider the weights and importance of features at different scales during fusion, and encounter accuracy limitations when detecting targets of varying scales. To tackle these challenges, we proposes a Scale-Robust Feature Aggregation and Diffusion Network (SRFAD-Net) for remote sensing target detection. This model includes a Scale-Robust Feature Network (SRFN), an Adaptive Feature Aggregation and Diffusion (AFAD) module, and a Focaler-GIoU Loss. SRFN extracts scale-robust features by constructing a multi-scale pyramid. It includes a downsampling (ADown) module that combines the advantages of average pooling and max pooling, effectively preserving background information and salient features. This further enhances the network's ability to handle targets of varying scales and shapes. The introduced Deformable Attention(DAttention) mechanism captures target features effectively through adaptive adjustment of the receptive field's shape and size, reducing background clutter and substantially enhancing the model's performance in detecting distant objects. In the feature fusion stage, we propose the AFAD module, which utilizes a dimension-adaptive perceptual selection mechanism and parallel depthwise convolutional operations to precisely aggregate multi-channel information. It then employs a diffusion mechanism to spread contextual information across various scales, greatly improving the network's ability to extract and fuse features across multiple scales. For the detection head, we adopt the Focaler-GIoU Loss, leveraging its advantages in handling non-overlapping bounding boxes, effectively alleviating the difficulty of localization caused by scale variations. We have undertaken experiments on two widely utilized aerial target datasets: the Remote Sensing Scene Object Detection Dataset (RSOD) and NWPU VHR-10, which is a high-resolution object detection dataset from Northwestern Polytechnical University. The findings of these experiments clearly illustrate that SRFAD-Net surpasses the performances of mainstream detectors.

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

A Novel Adaptive Edge Aggregation and Multiscale Feature Interaction Detector for Object Detection in Remote Sensing Images

ASFD: Automatic and Scalable Face Detector

Temporal-adaptive sparse feature aggregation for video object detection

Multi-view Aggregation for Real-Time Accurate Object Detection of a Moving Camera

Spatial Information Enhancement with Multi-Scale Feature Aggregation for Long-Range Object and Small Reflective Area Object Detection from Point Cloud

Adaptive Feature Aggregation for Video Object Detection

Spatial-Temporal Feature Aggregation Network for Video Object Detection

Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

Towards Better Object Detection in Scale Variation with Adaptive Feature Selection

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

Efficient object detector via dynamic prior and dynamic feature fusion

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling

Multi-Scale Interactive Network for Salient Object Detection

A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images

Accelerating real‐time object detection in high‐resolution video surveillance

Object Detection With Extended Attention And Spatial Information