Abstract:The significant differences in target scales of remote sensing images lead to remarkable variations in visual features, posing significant challenges for feature extraction, fusion, regression, and classification. For example, models frequently struggle to capture features of targets across all scales, inadequately consider the weights and importance of features at different scales during fusion, and encounter accuracy limitations when detecting targets of varying scales. To tackle these challenges, we proposes a Scale-Robust Feature Aggregation and Diffusion Network (SRFAD-Net) for remote sensing target detection. This model includes a Scale-Robust Feature Network (SRFN), an Adaptive Feature Aggregation and Diffusion (AFAD) module, and a Focaler-GIoU Loss. SRFN extracts scale-robust features by constructing a multi-scale pyramid. It includes a downsampling (ADown) module that combines the advantages of average pooling and max pooling, effectively preserving background information and salient features. This further enhances the network's ability to handle targets of varying scales and shapes. The introduced Deformable Attention(DAttention) mechanism captures target features effectively through adaptive adjustment of the receptive field's shape and size, reducing background clutter and substantially enhancing the model's performance in detecting distant objects. In the feature fusion stage, we propose the AFAD module, which utilizes a dimension-adaptive perceptual selection mechanism and parallel depthwise convolutional operations to precisely aggregate multi-channel information. It then employs a diffusion mechanism to spread contextual information across various scales, greatly improving the network's ability to extract and fuse features across multiple scales. For the detection head, we adopt the Focaler-GIoU Loss, leveraging its advantages in handling non-overlapping bounding boxes, effectively alleviating the difficulty of localization caused by scale variations. We have undertaken experiments on two widely utilized aerial target datasets: the Remote Sensing Scene Object Detection Dataset (RSOD) and NWPU VHR-10, which is a high-resolution object detection dataset from Northwestern Polytechnical University. The findings of these experiments clearly illustrate that SRFAD-Net surpasses the performances of mainstream detectors.

DSFNet: Video Salient Object Detection Using a Novel Lightweight Deformable Separable Fusion Network

Densely Deformable Efficient Salient Object Detection Network

DSFNet: Dynamic and Static Fusion Network for Moving Object Detection in Satellite Videos

AMDFNet: Adaptive multi-level deformable fusion network for RGB-D saliency detection

SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection

Full-duplex strategy for video object segmentation

Spatial attention-guided deformable fusion network for salient object detection

Video Salient Object Detection via Fully Convolutional Networks

Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection

HFMDNet: Hierarchical Fusion and Multilevel Decoder Network for RGB-D Salient Object Detection

AWANet: Attentive-Aware Wide-Kernels Asymmetrical Network with Blended Contour Information for Salient Object Detection

Middle-level Fusion for Lightweight RGB-D Salient Object Detection

A Single Stream Network for Robust and Real-Time RGB-D Salient Object Detection

An adaptive guidance fusion network for RGB-D salient object detection

SRFAD-Net: Scale-Robust Feature Aggregation and Diffusion Network for Object Detection in Remote Sensing Images

Middle-Level Feature Fusion for Lightweight RGB-D Salient Object Detection

A Saliency Enhanced Feature Fusion based multiscale RGB-D Salient Object Detection Network

PSNet: Parallel Symmetric Network for Video Salient Object Detection

SCFANet: Semantics and Context Feature Aggregation Network for 360° Salient Object Detection

Motion-Aware Memory Network for Fast Video Salient Object Detection

Lightweight Salient Object Detection in Optical Remote-Sensing Images Via Semantic Matching and Edge Alignment