Abstract:The significant differences in target scales of remote sensing images lead to remarkable variations in visual features, posing significant challenges for feature extraction, fusion, regression, and classification. For example, models frequently struggle to capture features of targets across all scales, inadequately consider the weights and importance of features at different scales during fusion, and encounter accuracy limitations when detecting targets of varying scales. To tackle these challenges, we proposes a Scale-Robust Feature Aggregation and Diffusion Network (SRFAD-Net) for remote sensing target detection. This model includes a Scale-Robust Feature Network (SRFN), an Adaptive Feature Aggregation and Diffusion (AFAD) module, and a Focaler-GIoU Loss. SRFN extracts scale-robust features by constructing a multi-scale pyramid. It includes a downsampling (ADown) module that combines the advantages of average pooling and max pooling, effectively preserving background information and salient features. This further enhances the network's ability to handle targets of varying scales and shapes. The introduced Deformable Attention(DAttention) mechanism captures target features effectively through adaptive adjustment of the receptive field's shape and size, reducing background clutter and substantially enhancing the model's performance in detecting distant objects. In the feature fusion stage, we propose the AFAD module, which utilizes a dimension-adaptive perceptual selection mechanism and parallel depthwise convolutional operations to precisely aggregate multi-channel information. It then employs a diffusion mechanism to spread contextual information across various scales, greatly improving the network's ability to extract and fuse features across multiple scales. For the detection head, we adopt the Focaler-GIoU Loss, leveraging its advantages in handling non-overlapping bounding boxes, effectively alleviating the difficulty of localization caused by scale variations. We have undertaken experiments on two widely utilized aerial target datasets: the Remote Sensing Scene Object Detection Dataset (RSOD) and NWPU VHR-10, which is a high-resolution object detection dataset from Northwestern Polytechnical University. The findings of these experiments clearly illustrate that SRFAD-Net surpasses the performances of mainstream detectors.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily addresses the challenges posed by significant differences in target scales in remote sensing image object detection. Specifically: 1. **Feature Extraction**: The significant differences in target scales in remote sensing images lead to substantial variations in visual features such as texture, shape, and edge information. Traditional feature extraction methods struggle to capture target features at different scales, and existing multi-scale pyramid construction methods tend to dilute or lose information when dealing with small targets, while large targets exhibit complexity and diversity. 2. **Feature Fusion**: Features at different scales exhibit significant differences in expression forms, resolution, and semantic information. Traditional simple fusion methods (such as stacking or concatenation) cannot effectively utilize the information from these features. Moreover, most methods perform feature fusion at only one level, failing to fully leverage the semantic information and resolution at different levels. 3. **Regression and Classification**: When there are significant differences in target scales, regression algorithms struggle to accurately adapt to targets of different scales, leading to discrepancies between predicted positions and sizes and the actual targets. Classifiers also find it challenging to accurately identify targets of different scales, thereby reducing classification performance. To address the above issues, the paper proposes a new network architecture—Scale-Robust Feature Aggregation and Diffusion Network (SRFAD-Net), which includes the following key components: - **Scale-Robust Feature Network (SRFN)**: Utilizes a special ADown module that combines the advantages of average pooling and max pooling to highlight important features while retaining background information. It also introduces a DAttention mechanism to dynamically adjust sampling positions and attention weights, enhancing the ability to capture distant targets. - **Adaptive Feature Aggregation and Diffusion (AFAD) Module**: Precisely selects and integrates multi-channel information through a channel-aware selection mechanism and propagates features containing rich contextual information to different scales through a unique diffusion mechanism, enhancing feature fusion capabilities. - **Focaler-GIoU Loss**: Combines the advantages of Focaler-IoU, which emphasizes difficult samples, and GIoU, which handles non-overlapping bounding boxes, to improve the accuracy of target position and shape description. Through these designs, SRFAD-Net effectively addresses the issues of feature extraction, fusion, regression, and classification in remote sensing image object detection, enhancing detection accuracy and generalization capability.

SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images

ℱ3-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images

FDLR-Net: A feature decoupling and localization refinement network for object detection in remote sensing images

A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images

CF2PN: A Cross-Scale Feature Fusion Pyramid Network Based Remote Sensing Target Detection

AFANet: A Multibackbone Compatible Feature Fusion Framework for Effective Remote Sensing Object Detection

SFSANet: Multiscale Object Detection in Remote Sensing Image Based on Semantic Fusion and Scale Adaptability

Exploiting Full-Scale Feature for Remote Sensing Object Detection Based on Refined Feature Mining and Adaptive Fusion

Semantic Information Feature Aggregation Network for Object Detection in Remote Sensing Images

FAFFENet: frequency attention and feature fusion enhancement network for multiscale remote sensing target detection

Object Detection in Remote Sensing Images Based on Adaptive Multi-Scale Feature Fusion Method

MFCANet: Multiscale Feature Context Aggregation Network for Oriented Object Detection in Remote-Sensing Images

An Efficient Feature Pyramid Network for Object Detection in Remote Sensing Imagery

SSN: Scale Selection Network for Multi-Scale Object Detection in Remote Sensing Images

A Multi-Feature Fusion and Attention Network for Multi-Scale Object Detection in Remote Sensing Images

Differentiation of bursal secretory‐dendritic cells studied with anti‐vimentin monoclonal antibody

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion

SGMFNet: a remote sensing image object detection network based on spatial global attention and multi-scale feature fusion

A Novel Adaptive Edge Aggregation and Multiscale Feature Interaction Detector for Object Detection in Remote Sensing Images